Progress report submitted by Scott Cain, GMOD Coordinator.
In the past nine months since the last progress report (see the 2006 progres report at http://blog.gmod.org/files/GMOD_2006_update.doc), the GMOD project has show significant progress in several areas. There have been meetings, software releases and the GMOD website was revamped to make it more useful and intuitive to users.
Since the last progress report, there was a meeting held in conjunction with the Plant and Animal Genome meeting in San Diego in January of
Two meetings are planned for the remainder of the year. The first, to be held August 23 and 24 at DictyBase at Northwestern University in Chicago will be a GMOD ‘Hackathon’, where a small group of developers will gather to address pressing development needs for core GMOD software functionality. A larger GMOD meeting will be held following the Genome Informatics meeting at Cold Spring Harbor Laboratory on November 5-7.
Based on a documentation requirements analysis conducted by Brian Osborne within the context of the GMOD documentation and helpdesk initiative spearheaded by NESCent (http://www.nescent.org/), it became apparent at the GMOD meeting in January that the GMOD homepage and the on-line documentation needed to be revamped to make it more accessible to the user community. As one of the results, the GMOD website was moved from a Drupal-based content management system to the wiki format (based on the open source MediaWiki software), which is much more familiar to many developers, and allows for better and easier structuring of the documentation. Spearheaded by Brian Osborne and NESCent, considerable effort was put into collecting, organizing, and where necessary, creating documentation for various GMOD components. The old GMOD website remains as http://blog.gmod.org/, where it is used as a weekly project update tool for several of the developers working on GMOD projects.
Several GMOD software packages have been released since the last progress report, the details of some of them are outlined in sections below. The packages include:
Also of interest in this area is the preparation for the next release of the core GMOD software, comprising and updated Chado schema and tools for interacting with the database. The scheduled release is timed to correspond to the August Hackathon at DictyBase.
While it is difficult to have an accurate accounting of how many users an open source prioject has, GMOD has been quite successful as far as we can tell in attracting users for its software. There are approximately 15-20 known users of Chado, including new users at BeetleBase, SmedDB (a planaria database) and Sol Genomics Network (a Solanaceae genome database) and VectorBase (a database for Invertebrate Vectors of Human Pathogens). A list of known GMOD users is at http://gmod.org/wiki/index.php/GMOD_Users.
Several GMOD projects submitted project reports to be included with the report and follow below.
All FlyBase data is now being managed in chado. The vast majority of curated and bulk data is processed into chadoXML using WriteChadoMac.pm, and loaded using XORT. XORT is also used for all data dumping for generating public web pages as well as data for curation support.
Published paper (Bioinformatics - In Press):
Title: A Chado Case Study: An Ontology-based Modular Schema for
Representing Genome-Associated Biological Information
Author(s): Christopher J Mungall, David B Emmert and The FlyBase
Consortium
Suzanna Lewis will present this paper at ISMB/ECCB, July 21-25,
2007, Vienna, Austria.
Active chado databases at FlyBase: Currently in chado
Additional genomes in the process of being implemented in chado at FlyBase
Report submitted by Lincoln Stein
Over the past year we have begun major rearchitecture work on GBrowse to make it faster and more scaleable. The chief of these rearchitecture steps has been to make it possible for individual GBrowse tracks to be rendered in parallel on a compute cluster. This will avoid the current penalty of rendering slowing down as additional tracks are added. This version is currently under active development and is not sufficiently stable for production use.
Other GBrowse enhancements are more incremental and are available on the production branch. These include “balloon tips”, a “lensing” interface that allows additional detail to be added to a displayed feature without cluttering the screen. Balloon tips can also be used to attach HTML menus and query forms to a feature. We’ve also created glyphs specialized for displaying high-density data, such as DNA tiling array data.
a text mining system for scientific literature
Report submitted by Hans-Michael Muller.
The new version Textpresso 2.0 is now the official version used for running the website. Its functionality has been expanded in many ways; we have introduced a filter function for the initial search result, which can be used to tailor the output further, such as restricting the results to certain years of publication. Search results can now be requested in XML format, which can be used for further computational processing by the user. The new system has been packaged and documented, so users can download the software and install it on their own machines. It will be released Mid June 2007.
We are now running Textpresso sites for three different literatures and have expanded the corpora for each site. As of June 2007, the C. elegans site had 8700 full text papers and 23000 abstracts and the D. melanogaster site contains 27500 full text papers and 54500 abstracts. Our neuroscience site has 15800 full text papers and 17300 abstracts. We have improved our full text and bibliography acquisition routines to automate as many steps as possible, and we will periodically update the corpora to include more papers.
As part of implementing the D. melanogaster site, we were faced with the problem of word sense disambiguation. Textpresso marks up biological entities such as gene names so they can subsequently be searched for in the full text to make searches more specific. Many D. melanogaster gene names are shared with common English names such as ‘a’, ‘for’, ‘wingless’, and ‘we’. We therefore developed a machine learning algorithm to disambiguate the meaning of a word of phrase. Even though we achieve an disambiguation accuracy of 90%, the overwhelming frequency of words such as prepositions and pronouns requires the inclusion of further information such as font type to identify gene names, as many gene names are written in italics in literature. This has improved the identification significantly.
Report submitted by Ben Faga.
We continue to optimize CMap for speed and useability. A major new feature over the past year has been the ability to stack maps on top of each other. This is particularly suitable for comparing a large-scale map, such as a genetic map, with many smaller-scale maps, such as BAC contigs. Another important feature is the interface for adding comparative maps. We now use AJAX technology to dynamically update the page, giving the user a more intuitive and interactive interface to this key function. A new view has been added to display correspondences in dot plot format. The installation process was modified to better identify directories where CMap components should be installed. The installation script can now set up a demo CMap data source which will allow a new user to see CMap with data immediately after installation. A network install script was created (modified from the GBrowse network install script). This script will download the CMap distribution and any prerequisites not found on the machine. When that is done, it installs CMap. Other minor enhancements have been made.
Work on the CMap Application Editor has continued. The CMap Assembly Editor (CMAE) is a desktop application being developed to assist in visualizing and editing large scale sequence assemblies for the maize sequencing project. CMAE will display sequence assemblies together with diverse mapping data in a tiered manor, giving a finisher a fuller context when making decisions. CMAE allows the finisher to move, merge and break maps. These changes can then be saved to the CMap database and exported to an external script which can modify the source data. CMAE can access data on the local machine or remotely using a web server running a specially configured CMap. CMAE can also interpret an XML document with specific maps to view which lets a script, designed to look for problem sections, to create specific views for a finisher to examine.
Report submitted by Peter D. Karp
Please note that the full history of updates to Pathway Tools can be found at URL http://bioinformatics.ai.sri.com/ptools/release-notes.html.
Significant updates funded under this grant since the last report in August 2006 are as follows.
During this grant period we we made several extremely significant bioinformatics advances. We developed a novel method of displaying, interrogating, and superimposing omics data on the full transcriptional regulatory network of an organism. We developed a novel method of viewing omics data in the context of the full genome map of an organism. We developed a graphical tool for interactively tracing metabolites through the metabolic network of an organism. And we developed a completely new database query language, and an associated graphical interface that allows biologists to intuitively compose database queries, which are automatically translated by the system into that database query language.
Significant software enhancements funded by the Pathway Tools grant during this period include the following.
New Regulatory Overview. This tool displays the transcriptional regulatory network of an organism that is defined in a PGDB. The network can be interrogated in several ways, such as highlighting all genes under a specified Gene Ontology class, and highlighting all genes regulated by a specified transcription factor. The current version works in desktop mode only; in the next release of Pathway Tools, the Regulatory Overview will work through the Web as well. To see the Regulatory Overview, go to http://biocyc.org/desktop-vs-web-mode.shtml#reg-ov.
The Regulatory Overview depends for its operation on an encoding of the organism’s transcriptional regulatory network within a PGDB. Currently, EcoCyc is the only BioCyc PGDB that contains such a regulatory network. PGDB authors can define such a network manually using the interactive editors within Pathway Tools.
The following additional enhancements to Pathway Tools were funded by other projects, but are available to all Pathway Tools users.
Report submitted by Suzanna Lewis.
The primary aim of the Apollo group over the past year was two-fold: To sustain our existing users by voluntarily contributing to documentation and user support; and to secure funding to buttress this minimal support and extend the capabilities of Apollo.
We have assisted and interacted with a number of different groups over the past year. One of the most interesting examples is with the INRA, Unite de Biometrie et Intelligence Artificielle (Marie-Josie Cros). This group is interested in non protein coding RNA (ncRNA) identification in bacteria and archaea. They were looking for an open source solution that they could extend for annotating such things as the terminator of transcription, repeated regions, conserved regions, predicted ncRNA, and so forth. They used the QSOS method 1.5 (http://www.qsos.org) for this evaluation, and based on the metrics of industrialization (documentation, quality method, installation, easy to use), adaptability (modularity, modification and extension of code) and data input/output formats allowed, Apollo was chosen over other commonly used genomic browsing software. We are planning on working with them to incorporate into the main code base the following extensions that they have developed:
Apollo is being used for teaching purposes at the Dolan Learning Center in Cold Spring Harbor, Washington University, and most recently at San Francisco State University. The number of user groups continues to expand as new genomes are finished. For example, Apollo was used for the annotation of the honeybee genome.
We are also pleased to report that we have succeeding in our second goal. Through the National Institute of General Medicine, grant R01 GM080203-01, we will begin new work on July 1, 2007.
The highest-quality annotation is obtained by combining automated sequence analysis results with the expert knowledge of biologists. Apollo is a cross-platform annotation editing tool that streamlines this process by providing an interactive graphical display that allows biologists to view many different computational analyses of a genomic region and use them, together with their knowledge of direct experimental results, to create and refine detailed annotations.
Our new work is focused on the following specific aims:
The work will be done in collaboration with The Arabidopsis Information Resource (TAIR) at the Carnegie Institution.
This work has generated software and associated documentation that is freely available to other investigators. These can be accessed in from the GMOD web site at http://www.gmod.org/wiki/index.php/Apollo.
Report submitted by Simon Twigger
RGD continues to use the GMOD GBrowse to provide genome browser functionality at RGD. We are using our GMOD Flash GViewer as part of our disease portals and will be replacing our older SVG-based GViewer use on our ontology report pages with the Flash GViewer in the coming year. Flash GViewer has been a popular download and is in use by a number of sites, for example:
We will be working on improving the rendering of larger datasets in GViewer, particularly on the zoomed view of a chromosome. The layout function works effectively for smaller numbers of features but becomes unwieldy at larger numbers of features. We would also like to increase the link out options to provide access to more than one external link for a feature or region.
Progress report, 2006/July - 2007/June, Don Gilbert
Significant effort during this 2006/2007 period was devoted to data updates and management efforts for wFleaBase (Daphnia), DroSpeGe (Drosophila) and Bionet news groups.
The Daphnia genome database (http://wFleaBase.org/) has been updated with several new genome data components for this emerging model organism’s public genome release in July 2007. These include gene predictions including NCBI’s Gnomon, JGI’s models, and GIL contributions, EST/cDNA assemblies with a gene validation assessment (using TIGR’s PASA pipeline), BioMart, GBrowse, and Blast updates.
The DroSpeGe comparative genome database of twelve Drosophila species (http://insects.eugenes.org/DroSpeGe/) has been updated with several annotation contributions from the genome informatics community, and efforts from GIL for various genome summaries, phylogenetic identification of 1000’s of new D.melanogaster genes, analysis of gene gain/loss in GO groups,
Bionet/BIOSCI news and discussion groups (http://www.bio.net/) provides world-wide public discussion in several areas of biosciences. These include several model organism communities (arabidopsis, yeast, celegans, drosophila, zebrafish, chlamydomonas).
Organization and introduction was prepared for the MOD Face caucus, Jan
Genbank to Chado worked example: This package provides updates for GMOD and Bioperl tools, to simplify creating Chado genome databases using NCBI GenBank genomes. This includes contributions to GMOD and BioPerl shared code base. GBrowse Chado Editor: This May 2007 addition is a simple way to add community annotations to Chado database. See http://iubio.bio.indiana.edu/gmod/genbank2chado/
Software tools to fully assembly, analyze and compare these genomes are available, but the ability to employ them is limited to those with extensive computational resources and engineering talent. In this project, methods are being developed for use by existing and emerging model organism databases that will address genome database access needs and middleware for comparative analyses. Effective use of shared cyberinfrastructure, such as NSF-sponsored TeraGrid and other Grid systems, is a problem today for bioinformatics. The planned work in this area addresses these problems with data grid methods that partition large genome database sets for effective use of Grid systems. PRELIMINARY: http://gmod.cvs.sourceforge.net/gmod/genogrid/
During the past year, TAIR has begun to expand its use of GMOD tools:
We used Apollo extensively as a curation tool in preparing our recent Arabidopsis genome release, TAIR7. We will continue to work with Apollo for our next genome release and will also participate in a newly funded project to develop a new transcript editor and other enhancements in collaboration with Suzi Lewis.
In addition, we are in the process of deploying GBrowse on the TAIR site.