Difference between revisions of "June 2007 Progress Report"

From GMOD
Jump to: navigation, search
Line 193: Line 193:
  
 
This work has generated software and associated documentation that is freely available to other investigators. These can be accessed in from the GMOD web site at http://www.gmod.org/wiki/index.php/Apollo.  
 
This work has generated software and associated documentation that is freely available to other investigators. These can be accessed in from the GMOD web site at http://www.gmod.org/wiki/index.php/Apollo.  
 +
 +
==RGD==
  
 
==TIGR==
 
==TIGR==
 +
 +
==MODWare/DictyBase==
 +
 +
==wFleaBase/Indiana University==
  
 
==GMODWeb/Turnkey==
 
==GMODWeb/Turnkey==

Revision as of 19:48, 13 June 2007

Overview

Scott's text here

items to touch on:

  • meetings (past and future)
  • wiki
  • software releases
  • users

Chado/FlyBase

All FlyBase data is now being managed in chado. The vast majority of curated and bulk data is processed into chadoXML using WriteChadoMac.pm, and loaded using XORT. XORT is also used for all data dumping for generating public web pages as well as data for curation support.

  1. Chado schema work:
    • Modifications of chado schema to improve support of genetic/phenotypic data.
    • Continued work on integration of genetic/phenotypic data with genome data in chado.
    • Published paper (Bioinformatics - In Press):

      Title: A Chado Case Study: An Ontology-based Modular Schema for Representing Genome-Associated Biological Information
      Author(s): Christopher J Mungall, David B Emmert and The FlyBase Consortium
      Suzanna Lewis will present this paper at ISMB/ECCB, July 21-25, 2007, Vienna, Austria.
  2. Perl module for generating chadoXML XML::DOM elements:
    WriteChadoMac.pm
    This module provides methods for writing chadoXML elements for most types of chado data, which can then be loaded using XORT.
  3. XORT:
    • Performance improvements:
      • Redesign of dumping methods to handle large data output and significantly reduce dump time.
      • Implementation of entity declarations to support non-ASCII characters in chadoXML.
      • Add functionality to support "limit", "distinct on", "group by", and "order by" SQL query operations.
    • Bug fixes:
      • Fix bugs for regular expression in parsing dumpspec files.
      • Fix warning messages.
    • Development of XORT dumpspecs for dumping various aspects of FlyBase data
  4. Development of a suite of XSL templates which translate chadoXML into HTML reports, GFF, FASTA files, and FlyBase search optimized database.
  5. Continued work on improving FlyBase search-optimized database and investigation of implementing BioMart.
  6. GBrowse:
    • Further development of PureGFF adaptor.
    • Plan is to submit this to GMOD later this summer after further testing by external testers to ensure that it works outside of FB environment.
  7. Active chado databases at FlyBase:
    Currently in chado
    • D. melanogaster genome, annotations, genetic & phenotypic data.
    • D. pseudoobscura genome and annotations
    Additional genomes in the process of being implemented in chado at FlyBase
    • D. simulans genome and annotations
    • D. sechelia genome and annotations
    • D. yakuba genome and annotations
    • D. erecta genome and annotations
    • D. ananassae genome and annotatons
    • D. persimilis genome and annotations
    • D. willistoni genome and annotations
    • D. mojavensis genome and annotations
    • D. virilis genome and annotations
    • D. grimshawi genome and annotations
  8. In development:
    CHIA - Chado Interface Application
    A java user interface integrating XORT dumper and loader functionality, pre-defined chado SQL query execution, ad hoc chado SQL queries execution, and simple reports of chado data.

GBrowse/WormBase

Lincoln's text here

Textpresso

a text mining system for scientific literature

Report submitted by Hans-Michael Muller.

The new version Textpresso 2.0 is now the official version used for running the website. Its functionality has been expanded in many ways; we have introduced a filter function for the initial search result, which can be used to tailor the output further, such as restricting the results to certain years of publication. Search results can now be requested in XML format, which can be used for further computational processing by the user. The new system has been packaged and documented, so users can download the software and install it on their own machines. It will be released Mid June 2007.

We are now running Textpresso sites for three different literatures and have expanded the corpora for each site. As of June 2007, the C. elegans site had 8700 full text papers and 23000 abstracts and the D. melanogaster site contains 27500 full text papers and 54500 abstracts. Our neuroscience site has 15800 full text papers and 17300 abstracts. We have improved our full text and bibliography acquisition routines to automate as many steps as possible, and we will periodically update the corpora to include more papers.

As part of implementing the D. melanogaster site, we were faced with the problem of word sense disambiguation. Textpresso marks up biological entities such as gene names so they can subsequently be searched for in the full text to make searches more specific. Many D. melanogaster gene names are shared with common English names such as 'a', 'for', 'wingless', and 'we'. We therefore developed a machine learning algorithm to disambiguate the meaning of a word of phrase. Even though we achieve an disambiguation accuracy of 90%, the overwhelming frequency of words such as prepositions and pronouns requires the inclusion of further information such as font type to identify gene names, as many gene names are written in italics in literature. This has improved the identification significantly.

CMap/Gramene

Ben's text here

Pathway Tools/BioCyc

Report submitted by Peter D. Karp

Please note that the full history of updates to Pathway Tools can be found at URL http://bioinformatics.ai.sri.com/ptools/release-notes.html.

Significant updates funded under this grant since the last report in August 2006 are as follows.

  • Versions 10.5 and 11.0 of Pathway Tools have been released in this period.
  • 1177 groups have licensed Pathway Tools to date.

During this grant period we we made several extremely significant bioinformatics advances. We developed a novel method of displaying, interrogating, and superimposing omics data on the full transcriptional regulatory network of an organism. We developed a novel method of viewing omics data in the context of the full genome map of an organism. We developed a graphical tool for interactively tracing metabolites through the metabolic network of an organism. And we developed a completely new database query language, and an associated graphical interface that allows biologists to intuitively compose database queries, which are automatically translated by the system into that database query language.

Significant software enhancements funded by the Pathway Tools grant during this period include the following.

  • New Genome Overview. This tool provides a one-screen view of every gene on one or more chromosomes and plasmids, and can display omics data across those entire replicons. The current version works in desktop mode only; in the next release of Pathway Tools, the Genome Overview will work through the Web as well. To see the Genome Overview, go to http://biocyc.org/desktop-vs-web-mode.shtml#genome-ov.
  • New Regulatory Overview. This tool displays the transcriptional regulatory network of an organism that is defined in a PGDB. The network can be interrogated in several ways, such as highlighting all genes under a specified Gene Ontology class, and highlighting all genes regulated by a specified transcription factor. The current version works in desktop mode only; in the next release of Pathway Tools, the Regulatory Overview will work through the Web as well. To see the Regulatory Overview, go to http://biocyc.org/desktop-vs-web-mode.shtml#reg-ov.

    The Regulatory Overview depends for its operation on an encoding of the organism's transcriptional regulatory network within a PGDB. Currently, EcoCyc is the only BioCyc PGDB that contains such a regulatory network. PGDB authors can define such a network manually using the interactive editors within Pathway Tools.

  • Metabolite tracing. A new metabolite tracing tool allows users to visually trace the path of substrates through the metabolic network within a PGDB, using the Cellular Overview diagram. To see an example of metabolite tracing, go to http://biocyc.org/desktop-vs-web-mode.shtml#metab-trace.
  • New BioVelo Query Language. We introduce a new advanced query language for querying PGDBs called BioVelo. BioVelo is a query-by-example system that allows users to construct extremely powerful queries using an intuitive graphical interface. BioVelo replaces the old Advanced Query Page. Users can construct BioVelo queries interactively through the BioCyc Advanced Query Page, http://biocyc.org/webQueryDoc.html (documentation)], and they can construct textual queries using BioVelo language (documentation).
  • Gene ontology assignments (both GO and MultiFun) are now displayed on gene-product pages in addition to gene pages.
  • New commands Proteins->Search by GO Term and Proteins->Search by MultiFun Term are available.
  • External databases. The editors now contain a command for creating or editing the descriptions of external databases for use in PGDB links to those databases.
  • The Pathway Hole Filler is now fully functional under the Windows operating system.
  • Monitor sizing. Through both the desktop and Web versions, Pathway Tools now knows the size of the user's monitor. For example, this users to create very large genome browserdisplays by reshaping their Web browser to the full screen of a wide-screen monitor.
  • Automatic patch loading. Whenever Pathway Tools starts up, it now performs its Instant Patch command automatically, so that users will always be running the latest set of patches.

The following additional enhancements to Pathway Tools were funded by other projects, but are available to all Pathway Tools users.

  • Display of protein features on protein pages has been improved.
  • Google searching. The Navigator query page now contains a section for performing a Google-based search of the PGDB, which uses Google's index of a PGDB to perform arbitrary text searches against the PGDB.
  • New All-Search box. An All-Search box is now present at the bottom of every PGDB web page to allow users to perform a new search without first clicking to the query page.
  • Name mouse-overs. Mouseover of compounds, genes, and proteins will additionally show all object synonyms.
  • Compound duplicate checking: The Compound Editor now checks if a newly created chemical compound is a duplicate of an existing compound in either the current PGDB or in MetaCyc, by searching both the names and chemical structure of the new compound.

Apollo

Report submitted by Suzanna Lewis.

Specific Aims

The primary aim of the Apollo group over the past year was two-fold: To sustain our existing users by voluntarily contributing to documentation and user support; and to secure funding to buttress this minimal support and extend the capabilities of Apollo.

Studies and Results

We have assisted and interacted with a number of different groups over the past year. One of the most interesting examples is with the INRA, Unite de Biometrie et Intelligence Artificielle (Marie-Josie Cros). This group is interested in non protein coding RNA (ncRNA) identification in bacteria and archaea. They were looking for an open source solution that they could extend for annotating such things as the terminator of transcription, repeated regions, conserved regions, predicted ncRNA, and so forth. They used the QSOS method 1.5 (http://www.qsos.org) for this evaluation, and based on the metrics of industrialization (documentation, quality method, installation, easy to use), adaptability (modularity, modification and extension of code) and data input/output formats allowed, Apollo was chosen over other commonly used genomic browsing software. We are planning on working with them to incorporate into the main code base the following extensions that they have developed:

  • Prediction and visualization of the secondary structure of a subsequence (item of a new menu RNA)
  • Prediction and visualization of RNA/RNA interactions (item of a new menu RNA)
  • Visualization (graph) of quantitative variables
  • Export of chained views

Apollo is being used for teaching purposes at the Dolan Learning Center in Cold Spring Harbor, Washington University, and most recently at the University of San Francisco. The number of user groups continues to expand as new genomes are finished. For example, Apollo was used for the annotation of the honeybee genome.

We are also pleased to report that we have succeeding in our second goal. Through the National Institute of General Medicine, grant R01 GM080203-01, we will begin new work on July 1, 2007.

Significance

The highest-quality annotation is obtained by combining automated sequence analysis results with the expert knowledge of biologists. Apollo is a cross-platform annotation editing tool that streamlines this process by providing an interactive graphical display that allows biologists to view many different computational analyses of a genomic region and use them, together with their knowledge of direct experimental results, to create and refine detailed annotations.

Plans

Our new work is focused on the following specific aims:

  1. We will enable Apollo to annotate a wider range of sequence feature by using the Sequence Ontology. This will also improve Apollo's interoperability with other biological data sources.
  2. We will implement a configuration interfaces to make it easier for researchers to set preferences and display new data sources.
  3. We will develop additional editors. One that will allow gene models to be modified in direct reference to multiple alignment data, and another to edit repetitive elements in detail.
  4. We will improve the analysis import code, documentation, and user interface to increase its ease of use, to enable biologists to more easily add on-demand analyses of their sequence of interest to the data being displayed.
  5. We will continue our Apollo support and outreach efforts, including including workshops and on-site visitsand training curricula for both biologists and software developers.

The work will be done in collaboration with The Arabidopsis Information Resource (TAIR) at the Carnegie Institution.

Publications

Mungall CJ, Emmert DB, Lewis SE. (2007) A Chado Case Study: An Ontology-based Modular Schema for Representing Genome-Associated Biological Information. Bioinformatics (in press).

Project Generated Resources

This work has generated software and associated documentation that is freely available to other investigators. These can be accessed in from the GMOD web site at http://www.gmod.org/wiki/index.php/Apollo.

RGD

TIGR

MODWare/DictyBase

wFleaBase/Indiana University

GMODWeb/Turnkey

TAIR

During the past year, TAIR has begun to expand its use of GMOD tools:

We used Apollo extensively as a curation tool in preparing our recent Arabidopsis genome release, TAIR7. We will continue to work with Apollo for our next genome release and will also participate in a newly funded project to develop a new transcript editor and other enhancements in collaboration with Suzi Lewis.

In addition, we are in the process of deploying GBrowse on the TAIR site.

MGI

SGD