GMOD’s November 2007 meeting was held November 5, 1:30PM to November 7, 12:00PM at Cold Spring Harbor Laboratory following the Genome Informatics meeting.
A list of suggested topics, raised in advance by GMOD community members.
There was a $25 registration fee to cover meals and other costs associated with the meeting. Please contact Scott Cain cain@cshl.edu if you need a reciept for your payment.
The meeting was held at Cold Spring Harbor Laboratory’s at the Woodbury building, which is not on the main CSHL campus.
We spent some time on our first day discussion what topics attendees would like to discusss. This list of topics helped shape the meeting agenda.
1:00 Shuttle from Grace Auditorium to Woodbury
1:30 Introductions
2:30 Coffee
5:30 Shuttle from Woodbury to Grace Auditorium
8:50 Shuttle from Grace Auditorium to Woodbury
9:15 ? Scott
10:15 Community Annotation
10:30 Coffee
12:00 Lunch
1:00 Standards and applications for storing comparative genome data
2:30 Coffee
5:30 Shuttle from Woodbury to Grace Auditorium
8:50 Shuttle from Grace Auditorium to Woodbury
9:15 BioPerl
10:30 Coffee
12:00 Shuttle from Woodbury to Grace Auditorium
The minutes here are based on Dave Clements’ notes from the meeting. They are far from complete and you are encouraged to expand and correct them.
The minutes are not chronological. Rather they are broken up into 3 sections:
We had several discussions about the big picture.
Don Gilbert pointed out that cheap short sequencers are now available. Lots of people have inexpensive sequnces, but there still is no way to do cheap annotation.
Current GMOD clients are species or family centered. Want to make it easy to integrate multiple species. ApiDB is at the point of opening new species databases and web sites with relatively little effort.
Comparative genomics came up over and over again, both across species and within species.
As data grows and is consolidated, issues of who owns the data and who’s responsible for the annotation become more problematic.
How does GMOD want to deal with integration issues?
How close to the sequencer does GMOD want to get? We don’t want to pull the data off the sequencer.
Should we position GMOD as something that can feed data into places like Ensembl? Ensembl does not have curation expertise of the MODs. Even if NCBI is wonderful at consolidation, they won’t have quality curation. GMOD sits right there, supporting curation. So, we doubt that Ensembl or NCBI will swallow us whole.
We need to figure out what components we want and what we are pushing. If we focus on a core set of packages then life gets easier for the project.
There was discussion of better release management for components, and the VMWare Community Annotation Server package. Are GMOD bundles the way of the future? Believe that binary packages are generally not going to work for GMOD unless someone is willing to put a lot of time into maintaining them.
Comparative genomics came up over and over again, both across species and within species. The GBrowse_syn talk in particular spawned a discussion on this.
First, can Chado represent relationships that have more than two members? Yes. Feature_loc has a rank column. Do we want collections in Chado?
Jason suggested a working group on how to do this. Dave from UMD volunteered to manage a wiki page on this, with the end goal of establishing a document that defines how to store comparative genomes.
Talks on synteny are spread throughout this document.
Work has resumed on developing Apollo. Ed Lee formerly of TIGR/JCVI started working for Suzi Lewis at Berkeley this fall and is working on it. Work is being done on
Apollo can talk directly to a database or it can use XML files instead. FlyBase, VectorBase, BeeBase, and BovineBase are all believed to take the XML approach.
Apollo currently has two choices for database adaptors:
The trigger version is used in the Community Annotation Server and on the Dolan-Rice project. We could not think of anywhere else it was used. The triggerless version is used everywhere else that we knew of.
The trigger version is Postgres specific. The triggerless version stores multiple copies of shared exons.
Notes from Tuesday: Decided to actively discourage use of the trigger version. Best thing may be to go through trigger code and externalize the logic.
Notes from Wednesday: Apollo - Chado - No short term decision. Long term probably move to Crabtree.
As you may have noticed, those notes disagree.
There was a discussion of BioPerl and how it relates to GMOD.
Jason Stajich created a slimmed down feature Perl package based on arrays instead of hashes: Bio::SeqFeature::Slim. This is 70% faster for reading a GFF file. Bio::Feature::IO only supports GFF3. It is slow, uses heavy objects, and is strongly typed. Jason wants to spend more time on middleware speed. He also wants converter into a common object model and code to get it back out to any supported format.
6 to 8 people are currently contributing to BioPerl.
GFF3 has an ID field. ID is not clear in earlier versions. GFF2 supports arbitrary feature types. GFF3 requires SO types (but you can always ignore that). Keep detailed alignment data in a separate database, not in GFF3. Indicate in GFF3 that data is stored elsewhere. Could store cigar strings in GFF3 and spec supports that.
There was a request to make to Chado be more database neutral, rather than Postgres-specific.
The slowness of Chado databases came up in several contexts. David from UMD Medical Center started a Postgres performance page on the wiki.
Scott described a potential way to implement materialized views in Chado that gets us most of the benefits of DBMS-supported materialized views. Store
Question was raised if genome metadata fits into the current Chado. The belief was that it does not.
Jason Stajich wants a better idea of who is responsible for what in terms of Chado modules. Dave C will take this on.
The table level and column level documentation for Chado is in a good state. Enhanced basic, big picture documentation was requested. Josh Goodman is thinking of providing a mapping from Chado DB columns to FlyBase report columns. Mike Caudy pointed out we should have multiple examples of implementation, not just FlyBase.
We discussed if a Chado database validator would be worthwhile. A validator would check a Chado database to see if it conforms to the canonical model for a Chado database. There was no consensus on the value or practicality of this. There was consensus that no one was willing to volunteer to write it.
Ben suggested that if and when we do this, we use the GFF3 to Chado validator as a starting point.
There was a request to make to Chado be more database neutral, rather than Postgres-specific. Someone also asked if there was an SQLite adapter for GBrowse.
Slow performance of Chado Postgres implementations came up repeatedly.
Some bits:
Presentation: CMap Progress Report, Ben Faga
New CMap release (1.0) is on its way. Will have an assembly editor. Includes a dot plot, new glyphs, and an install script based on the GBrowse install script.
Ben will ask users to do beta testing, and hopes to start with that before end of 2007. Ben is looking for a project that is doing large scale assembly, to test CMap for doing assembly correction.
This was a popular motif in the meeting.
Presentation: Community Annotation, Linda Sperling
Linda Sperling discussed ParameciumDB. Paramecium is a small community with few resources and no dedicated curators.
Paramecium curators are a small set of people that must do their annotation from fixed IP addresses. Curator annotations are kept in addition to existing Genoscope predictions. These annotation are not validated when they are submitted. Annotators cannot chage annotations made by other people. There are two databases: one backing the website, and one where annotation goes. Once a month the new annotation is pushed to the web site. Validation happens prior to release.
They are also using ParameciumDB to teach annotation at two colleges, and some annotation comes from that. The bulk of annotations come from 2 curators, with the other curators all making a small number of annotations.
Uses Java WebStart version of Apollo. Annotators click on link and Apollo starts up. Apollo talks directly to Chado, using the triggerless database adapter.
Don Gilbert briefly described community annotation at JGI. They have a web interface for simple annotations and use Apollo for complex annotations. Anyone can promote any gene model, but they can’t delete other models. Use the Wikipedia model: Whoever annotates last is correct.
Lukas Mueller discussed SGN.
SGN has data for tomato, potato, eggplant, and many other species. SGN is locus centric. Each locus has (or can have) a single person who is the editor/owner of that locus. The locus editor can change anything about that locus that they want. The name of the locus editor is displayed on the locus page. Every locus has a “request editor privileges” link, if that locus has been assigned or not.
All edits are logged, and nothing is ever truly deleted. ‘Deleted’ items are retained but flagged as obsolete and are no longer shown.
SGN supports tagging of loci. Tags are free text that are rationalized after they are created. The tagging metaphor for curation also came up in several contexts during the Genome Informatics meeting.
Scott Cain spoke about this. It is almost ready to go. The Community Annotation Server (CAS) is meant to be “GMOD in a box”. Currently it consists of:
Note that it does not include Turnkey and/or GMODWeb. Lincoln would like to add GMODweb, Textpresso and BioMart to that list.
This can run on any Intel machine, including Apple. Very little performance hit is caused by virtualization.
An online trial version of the Community Annotation Server was requested and was already on the way.
Gregg Helt attended with the goal of bringing the Distributed Annotation System, version 2 (DAS/2) into the GMOD family.
Preserving DAS/1 Strengths in DAS/2
Allen Day built a DAS2 server on top of Chado. That is in CVS.
There is a validation suite for server responses to different queries.
Spec has not changed in over a year.
Scott would like that when someone installs Chado, they also get BioMart and DAS2. That is, they get access by default. Gregg would like to see GBrowse get a DAS/2 adapter.
Lincoln Stein talked about upcoming releases of GBrowse.
Version 3.0 (now called JBrowse) is a fork of the code and version 2 and 3 are expected to co-exist ‘forever’. Some shops won’t have the horsepower to power version 3, and Lincoln wants to keep it as an easy to install tool.
Chado is usually too slow to run GBrowse on top of. Consider using Bio::DB:GFF instead. (Can’t run GBrowse on top of BioMart. No adapter exists because of BioMart’s flexible schema.)
Jason S argues that GBrowse slows down when it does BioPerl object creation. These are relatively heavyweight objects. He has just written a Slim version that is up to 70% faster.
Browser speed was also the number one issue (with all browsers) at the Genome Browsers Birds-of-a-Feather meeting at Genome Informatics.
Presentation: GMOD Indiana update slides, Don Gilbert
Don Gilbert spoke about Genome grid.
Genome Grid is middleware to enable easy use of TeraGrid for genome analysis tasks. Don is looking for genomes that need compute intensive analysis. He also interested in applying BioMart and Ergatis to these problems.
Dave Clements introduced himself and the goals of the GMOD Help Desk position.
Dave will make the help desk more visible on the web site, and add a GMOD News column to the home page.
Presentation: Recent Developments in Pathway Tools
Suzanne Paley talked about recent developments in Pathway Tools, including:
Presentation: Modeling and Displaying Synteny w/ SynView, Steve Fischer
Steve Fischer of ApiDB (see below) spoke about SynView. SynView is a synteny browser based on GBrowse. It is described in a Bioinformatics paper.
His talked raised a number of issues that have come up with recent extensions to SynView.
This is a MediaWiki extension by Jim Hu. It does two things. First, it makes it easier to update tables in MediaWiki, by presenting a nicer interface for altering wiki tables. Secondly, it supports synchronizing MediaWiki tables from database tables and vice versa.
These are all web interface layers that lay on top of Chado databases.
GMODWeb is currently not working, we think because SQLTranslator has not been upgraded to deal with recent versions of Postgres. Ben Faga agreed to actively work on this.
Michael Caudy argued that even if GMODWeb did work right now that it is not extensible enough to support complex queries and presentation. Mike presented Drupal, Drupal Views, and PHPTemplate as an alternative web framework for providing a web interface to Chado databases. Mike demonstrated a prototype called DrupalFly that presents FlyBase data in an alternative organization.
Lincoln has an opening in Toronto for a full time programmer. Lincoln will talk with Brian about GMODWeb’s future. We will put something on web site asking for volunteers to take on GMODweb.
A number of organizations talked about their recent work.
Presentation: ApiDB GBrowse update, Haiming Wang
Steve Fischer talked about ApiDB. ApiDB uses GUS as their schema. They do multispecies comparative analysis. They have a database adapter link from GBrowse to GUS. It is based on the Chado adapter. They use materialized views in Oracle 10G and it is still relatively slow.
See SynView above for details on SynView.
Syntenic maps at ApiDB are produced with Mercator. The maps are based on gene orthology. Gene orthologs are generated using OrthoMCL. All alignments are pairwise, rather than multiple. Orthology is represented outside standard GUS schema. In the synteny schema, everything is defined relative to the reference sequence. Also need a table to define anchors.
Steve Fischer showed an 11 track page, which has about 5000 popups in it.
ApiDB has a release cycle. They discard and recalculate synteny with every new release.
The Berkeley group is actively involved in supporting and developing Chado, GO, SO, OBO-Edit, Phenote, Apollo, and the new AJAX GBrowse.
FlyBase has migrated their production databases to the Chado database schema. FlyBase uses:
Victor Strelets talked about OrthoView, an extension to GBrowse for viewing synteny.
Victor also presented the genetic interactions viewer, a fast way of visualizing gene interactions. It does not run directly off of the Chado database.
Presentation: Community Annotation, Chinmay Patel
Chinmay Patel spoke about a week-long annotation project at Sanger involving 40 people all annotating the same genome.
They used the Artemis annotation editor (instead of Apollo), but Artemis was talking to a Chado database using an Artemis-Chado Ibatis-based (instead of Hibernate-based) adapter. The adapter is not yet released. (But it is now: see Artemis-Chado Integration Tutorial.)
Using GMOD to support a fungal sequencing project. Using:
Using Chado as database schema.
Taner Sen from MaizeGDB was at the meeting. Maize has multiple groups generating different gene models. It would be nice to display each groun in a separate track. MaizeGDB is evaluating genome browsers and is considering using GBrowse.
Presentation: Community Annotation, Linda Sperling
Use GMOD for almost everything:
Paramecium is an odd critter (unicellular eukaryote, ciliate clade):
Fewer than 20 paramecium molecular biology labs in the world. Database supported with 1.5 staff.
It is important that people be able to click on a link, launch Apollo, add some curation and save it. Their Apollo talks directly to Chado (no triggers). See Community Annotation above for more.
Riken uses GBrowse.
Use Chado as a backend, a lot. Use Sybil for comparative genomics, and are a mix of PostgreSQL and Oracle.
Presentation: Keynote, Powerpoint, PDF, Mov, Todd Harris
Wormbase is migrating to Chado slowly. There is currently very little Chado there.
Presentation: Gbrowse_syn, Sheldon McKay
Sheldon McKay talked about GBrowse_syn, a prototype extension to GBrowse for viewing synteny. Goal is to have a sequence alignment viewer that can look at more than two species at a time. GBrowse_syn is based purely on sequence alignments. It does not know about genes or orthologs per se.
Used PECAN for the alignments. Maps are precomputed in a very CPU-intensive step.
Chado may or may not support multiple alignments.