Revision as of 21:55, 27 February 2014 by DanBolser (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

<-- Back to GSoC 2011

Here is a space for more details and project coordination. Please use it.

Email 01

As I see it there are two main strands, a) create a data model to for the "environment/trait/SNP/propensity/citation/opinion" tuples, and b) put a user-friendly wiki-style interface on top.

Sadly, I really can't commit that much mentoring time to the project (sorry, but I'm just being realistic). So the main requirement would be for you to be able to work independently with only perhaps a few hours of high level guidance per week from me. However, you're welcome to work on this project with anyone.

If you are interested, you could try playing with Semantic MediaWiki and Semantic Forms, which I see as two useful tools for this project:

Email 02

You're absolutely right, genes (or specifically, the proteins they encode) are the functional workhorses of biology, doing the 'work' of the cell and the body. The biological differences between individuals are often attributable to changes in proteins or their regulation (such as changes in their expression level). These differences are evident at the DNA level, as DNA encodes the proteins and their regulatory elements. Now the interesting thing is that specific SNPs often serve as markers for particular blocks of DNA (called haplotypes) - when 'shuffling the deck' of human DNA, their are fewer cards than you imagine, because DNA is only mixed in these large blocks. So... by measuring SNPs, you can predict a lot of biology by association.

This principle underlies the science of 'genome wide association studies' (GWAS). If you like I can put together some reading on this topic for you, as it's something I'd like to understand better too!

Email 03

BTW, does the idea make sense or should I try to clarify it? Personally I'd like one database of 'risk factors' including genetic and environmental factors ... that way I can see how much my risk of 'drinking alcohol increases risk of cancer' outweighs my 'low risk of colon cancer'. I figure one way to motivate people to contribute to such a DB is to score them personally and rank the 'best' ;-)

Email 04

> I would like to know a bit more about what you are looking for(in
> IDEA#10: The genome game: crowd-sourcing better crops), specifically:
>  1) Wether you require ability to import/export databases

I think there will be an initial import of data from resources such as
23andMe, dbSNP, 1k and 10k genomes projects, PGP, SNPedia, 'GET
Evidence' and dbGap. Probably others can be considered.

I can't think of a good reason not to allow export of our data.

>  2) Certain variations tend to occur together (even if not affecting
> the same characteristic); so should the rank of one influence the rank
> of its      friends

Yes. This is the basis of most SNP associations. I.e. the genome chip
interrogates a 'tag' SNP that is correlated with a particular
'characteristic causing' mutation (SNP, CNV, InDel, etc.) Information
about these 'haploblocks' can be obtained from the 1k genomes project,
but it's not crucial TBH. The fact that SNP x is associated on average
with increased propensity y is enough information for our purposes.

>  3)Does the database have to be seeded initially with information or
> is just developing the interface enough (as I don't think it would be
> possible for me to furnish data to seed the database in a 3 month time-
> frame)

This is where the 'game mechanics' need to be considered. No one is
going to contribute their genome to an empty database, so to get
people to contribute, we need to seed the system with some well
studied, predictive and interesting associations. I.e. breast cancer,
diabetes, obesity, smoking, exercise, diet... Once we get a good set,
it becomes interesting for people to contribute their data to see how
they rank in the system. Hopefully, (the idea is that...) people will
then start to add associations from studies that are relevant to them,
trying to improve their score or trying to create the 'best'
individual possible within the system.

It's harder if we focus on crops, because a) there are less resources
available, b) phenotype data is harder to come by, and c) fewer people
are interested in crops.

>  4)What kind of access controls to the system are you looking for?

Good question... I was thinking that we would just run it like
Wikipedia... a few carefully chosen admins who can kick people around
a bit, but mostly let anyone do anything... The dream would be some
form of computational argumentation augmented consensus, which isn't
unimaginable, but is optimistic.

I thought about a 'contributor score' as well as a 'genome' and
'lifestyle' score (nudging people towards logging in), but I'm not
sure, it's marginal.


I have to say that (unfortunately) I won't be able to promise you much
time as a mentor on this project. If you like the idea, I strongly
suggest you try to find additional mentors (I'm really sorry about
Here is a link to the 1k genome project, where the data files can also
be found (sorry "1k genomes" is prolly not
the 'canonical' term for the project).

Here is the PGP (10k genomes):

Not sure if they have released data yet though :-(

23andMe will be a screen scrape job.

I don't know anything about GET Evidence yet.

Other sources could include, for example,

Email 05

> 1. Does the proposal presented there means that those selected would be
> trying to develop a website cum software where researchers involved in
> genome studies could post their researched genomes and define the type of
> environment best for that particular genome.

Yup! And vice-verse, the best genome for a particular environment.

> 2. And then any individual who knows something about his/her genome could
> check on that site what type of environment is best for him/her.

Yes. I envision people uploading their personal genomes (and lifestyle
characteristics) and 'playing the game' to see where they rank in
terms of genomic fitness (and environmental fitness).


Cheers Vibhore, will you be able to work on this regardless of
Google's decision? I think it would be cool to get a group of
interested students together to work on this collaboratively.

Email 06


I think idea 10 is two fold:

1) Create a database system within which markers for desirable or
undesirable genetic traits (discovered by GWAS or other published
studies) can be objectively scored with respect to certain
environmental conditions.

For example, when designing a potato for high yield, disease free
growth in an arid country with no frost, which combination of genetic
markers scores best? The database system will answer this query using
scored data about marker trait associations (from studies) and
environmental 'rules' that condition those scores.

2) Gather data for such a system by 'crowd sourcing' knowledge and
literature information from experts in the form of a game. The game
sets the challenge of designing the best or worst individual given the
current data in the system, and motivates players to contribute to the
data in the form of a structured wiki (such as Semantic MediaWiki).

For example, given that I'm a smoking, non-drinking, vegetarian who
takes 2-3 hours of exercise a day, and given that I have a relatively
low genetic risk of cancer but a higher than average risk of venous
thromboembolism [1], how do I rank relative to the other players in
the game? What lifestyle or genetic changes boost my score?

Similarly, what is the consensus on the health benefits of
vegetarianism, and what sources are cited there? (And how is this
'argument' and consensus computationally encoded in the system?)