GMOD

April 2004 GMOD Meeting

Generic Model Organism Database Construction Set

Contents

Meeting 4

GMOD Meeting April, 2004

Presentations

Agenda

April 26, Morning: Combined Developers and Curators section

Mount Vernon Room

9:00am Introductions
Scott Cain (CSHL)
9:20
Don Gilbert (FlyBase, Indiana University)
10:30 Break
10:45
Frank Smutniak (FlyBase, Harvard University)
11:20
Stan Letovsky (FlyBase, Harvard University)
11:45 Lunch (on your own-many good restaurants-check with a local)

April 26, Afternoon: Developer section

Mount Vernon Room

1:30
Arek Kasprzyk (EBI)
2:00 GMOD/Turnkey web demo
Brian O'Conner (UCLA)
2:30 Break
3:00
Eimear Kenny (WormBase, CalTech)
3:30
Toshiaki Katayama (Human Genome Center, University of Tokyo, Japan)
3:45 Break
4:00
David Emmert (FlyBase, Harvard University)
4:30
Scott Cain

April 26, Afternoon: Curator section

Terrace Room

1:30 Jennifer Wortman (The Institute for Genomic Research)
1:50 Shannon Schlueter (Arabidopsis thaliana Plant Genome Database, Iowa State University)
2:10 Aniko Sabo (Genome Sequencing Center, Washington University School of Medicine)
2:30 Break
2:50 Madeline Crosby (FlyBase, Harvard University)
3:10 Kim Worley (Human Genome Sequencing Center, Baylor College of Medicine)
3:30 Astrid Terry (Joint Genome Institute)
3:50 Break4:10Chinnappa Kodira (Broad Institute)
4:30 Michele Clamp (Broad Institute)
4:50 Break
5:00 Group discussion
6:00 Dinner (on your own-see above)

April 27, Developer section

Mount Vernon Room

   
9:00 GMOD Alpha release, Part II

The goal here is to try to get the gmod alpha installed on computers to test the installation and working issues with the release. It is almost certainly the case that there will also be time for “breakout” sessions for smaller groups to discuss a variety of topics. Suggestions will be accepted both before and during the meeting.

April 27, Curator section

Terrace Room

9:00 Apollo Demo
Sima Misra (FlyBase & Berkeley Drosophila Genome Center)
9:40
Nomi Harris (FlyBase & Berkeley Drosophila Genome Center)
10:00 Hands-on Apollo workshop for curators
Breakout session for Apollo developers
12:00 Lunch
1:30 Q&A session with curators & Apollo developers
2:30 Hands-on Apollo workshop for curators
Breakout session for Apollo developers
4:30 Q&A session with curators & Apollo developers
Group discussion

April 27, Dinner

6:00
Reservation on us, food paid for by you

April 28, Combined Developer and Curator section

Mount Vernon Room

9:00 Updates from the previous day
Scott Cain and Sima Misra
10:00 Break
10:15
Bill Gelbart (FlyBase, Harvard University)
11:15 Break
11:30 Closing remarks, planning for the next meeting
Scott Cain

Progress reports

GMOD Progect Progress Reports
April, 2004
-----------------------------

The past four months have seen the first two releases of gmod, which will
become the suite of model organism database software.  The first release,
version 0.001 (alpha), was release in January, 2004.  The main goal of that
release was to establish a release procedure.  The release consisted of a
database schema, referred to as chado, which is the database schema
developed primarily by FlyBase developers at Harvard and BDGP.  Additionally,
there were a variety of tools for installing and loading data into the
database which were developed primarily by Allen Day at UCLA and Scott
Cain at CSHL.  Finally, there was a compatible version of the Generic
Genome Browser with a chado database adaptor developed to allow browsing
of genome features directly from the database.

The second release, also an alpha release, consisted of the same components,
and was release in March, 2004. In this release, the installation procedure
improved considerably, and a prerequisite that had caused testers difficulties
was removed.  During the GMOD meeting in April, this release was installed
by several attendees during a workshop.  Several suggestions were made that
will be implemented in the next release.

There are several items planned for addition or improvement in the next
two releases.  Tools to allow importing and exporting XML formatted data
from chado will be included, which will allow the sequence annotation tool,
Apollo, to be used with chado. Addtionally, template based web front end for
chado called turnkey will be included in an upcoming release.  This software is
still early in the development process, but when it was presented to
developers at the GMOD meeting in April, there was considerable interest
in getting it included in a gmod release as soon as possible.

Longer term goals for gmod releases are including pubsearch and pubfetch.
The process of porting these applications has begun and is expected to be
complete by the end of the year.  A tool for liturature based sequence
annotation, called JavaSEAN, is expected to be included in gmod in a similar
time frame.  Additionally, there are plans from the Apollo developers to
create a new version of Apollo that will be able to read and write directly
to the database without using an XML intermidary, which will simplify the
process of sequence annotation considerably.



Apollo Progress Report (11/2003 - 4/2004)

Major improvements in release 1.3.6 (11/3/03):

Apollo now runs under JDK1.4, which works better on most platforms.

Can rubberband a region on the axis and the selected sequence will pop up
in a Sequence window.

Results that represent hits against sequences that are new to their
respective database (as indicated in tiers file) are shown with a box
around them, so that the curator can immediately see which results are
new and need to be looked at.

Search (Find) now allows full regexps.

Instead of having the config files in $HOME/.apollo be slightly modified
copies of the ones in APOLLO_ROOT/conf, you can now put ONLY the stuff
you want changed into your personal cfg files.  Apollo will first read
the ones in APOLLO_ROOT/conf, and then read your personal cfgs and apply
any modifications.

Synteny (see Synteny section at end)


Major improvements in release 1.4.0 (internal release) (2/9/2004):

New game.tiers file format (easier to read and change).  If you have an
old game.tiers, it will be autoconverted to the new format.

Better handling of non-gene annotation types.  New glyphs for showing
them in main Apollo display.
New annotations are automatically assigned the type (e.g. gene, tRNA,
etc.) appropriate to the evidence that was used to create them.  (Type
can then be changed in the annotation info editor, if desired.)

Structured transaction records are now added to the XML when you save.
They include the type of object that changed (e.g. TRANSCRIPT;
ANNOTATION; COMMENT), the operation (e.g. ADD, SPLIT, etc.), the relevant
names and/or IDs before and after the transaction, and the user and
time/date when the change was made.

Support for translational exceptions, including frame shifts and one base
pair genomic sequencing errors.

UTRs are now shown in a different (configurable) color from the rest of
the gene.

Restriction enzyme mapper:
- Cut sites show up in main window (near the axis)
- Can now map multiple restriction enzymes at once
- Table of restriction fragments; can be selected for viewing in
Sequence window

Annotation info window:
- Now has integrated annotation tree
- Shows arbitrary properties for annotations and transcripts (including
validation_flag)
- Shows translational exceptions and genomic sequencing errors
- Lets you edit annotation ID as well as name/symbol

Ability to tag results by selecting from a list of comments, which are
specified (as ResultTags) in game.style.  Tagged results are crosshatched
in pink in the display.

Fixed updating of peptide sequences.


Improvements in releases 1.4.1 (3/12/04) and 1.4.2 (3/18/04):

Red/green markers at axis show where sequence/region ends.

To help you identify splice sites that are unconventional, colored
triangles appear in the annotation glyph.

Can now load D. melanogaster data from r3.1 (gadfly) and r3.2 (chado)
(both via cgi).


1.4.3 (4/19/04):
Let users get the sequence of the entire segment you're looking at, not
just a rubberbanded section.  [File -> Save sequence]



Synteny progress, 11/03-4/04:

- Synteny now works with GAME. You can load one species and then use the
blast or syntenic block results to another species (for now it's pseudo)
to load another species. The other species is loaded with the same range
around that feature. Links between the two species are automatically
derived from the blast link features that are present in both datasets
(no explicit link file needs to be specified).

- Database chooser was added to select the different species databases.

- Able to switch back and forth from synteny data adapter to regular data
adapters without restarting Apollo.

- Can save and edit (edit could use some rigorous testing)

- Can home in on link from link popup menu. Zooms and shows the strands
of homed in link, strands not in link are hidden.

- Species now zoom and scroll together by default. Can unlock zoom with
shift key, and unlock scroll with menu item.

- You can now config links between 2 curation sets that contain links to
each other. link_type, source and hit species are specified in the linked
type in the tiers file. This works with game, in theory could be made to
work with other adapters that have linked data embedded in the species
data.



Textpresso: A progress report

Eimear Kenny, Hans-Michael Mueller and Paul Sternberg

Updates made to Textpresso since September 2003:

Textpresso for Yeast Literature
(Toward a generic MOD information retrieval/extraction search engine)

SGD developers and curators met with Eimear Kenny for two weeks at
the begining of March at Stanford to build a Textpresso search engine for
Yeast. During that period the Textpresso software was installed on a
Solaris system and three builds with a test corpus of ~400 full text
journal articles were completed. In addition, the Textpresso Ontology for
worm literature was modified to a functional preliminary ontology for
yeast literature. Plans to expand the corpus to 10,000 yeast papers and
make improvements to the yeast ontology are underway at Stanford.

Integration of Textpresso into Literature Curation Pipeline

We have integrated Textpresso to the Wormbase curation pipeline
to expediate the extraction of genetic interaction information from the
literature. A prototype curation interface has been developed to enable a
curator to extract data from sentences returned by a Textpresso query for
genetic interaction. We find that these Textpresso sentences are enriched
3-fold for gene-gene interactions compared to sentences that mention two
or more gene names and 39-fold compared to random sentences from the
literature.

Textpresso MOD interface

We have generated a Wormbase-like interface for Textpresso to integrate
the Textpresso information retrieval engine in the Wormbase web-site.
http://www.textpresso.org/cgi-bin/wb/textpressoforwormbase.cgi?allabstracts=on&searchmode=sentence&searchtargets=Paper&searchtargets=Abstract

Textpresso Package

Hans-Michael Mueller is working on packaging Textpresso for release in
the first half of this year.

Textpresso paper ... under review

A Textpresso publication is currently under revision.



PubFetch/PubTrack Progress Report (April 2004)

PubFetch
PubFetch is a tool for accessing literature from various online resources.
The goal is to provide a common interface and common format to downstream
applications to allow them to query different literature repositories in
a single, unified fashion.

PubFetch has been implemented in two forms:
* Java servlet core + simple web interface to provide interactive access
to PubFetch
    * Provides access to PubMed and Agricola databases
* BioMOBY wrapper around servlet core to provide webservice access to PubFetch

A variety of new features have been introduced:
* Duplicate filtering - running the same search on multiple data sources
results in some duplication of articles, the duplicate filter detects
these articles returning a non-redundant set of data. Database Ids from
both sources are maintained in the non-redundant set.
* The web interface version highlights keywords in the search results to
aid in review of the returned articles.
* Connection to full text - a hyperlink to the full text is returned (if
available from PubMed)
* Filtering of 'ahead of print' articles - Abstracts are appearing in
PubMed and being assigned PubMed Ids prior to being published and are
being reassigned PubMed Ids after publication. PubFetch allows filtering
of these ahead of print articles to retrieve only published articles.

The BioMOBY interface provides the following services:
* SearchPubmed - Search PubMed for given query and get PMIDs
* GetPubmed - Retrieve PubMed articles in MEDLINE display format for given
PMIDs
* FetchFull - Get FullText for given PMID
* fetchAgID - Search Agricola for given query and get Agricola accession number
* fetchAgDoc - Get Agricola document in MEDLINE like format for given
Agricola accession number

Current work
The integration of PubFetch and PubSearch is in progress, our goal is to
have PubSearch using the PubFetch core module for literature retrieval by
summer of 2004. We will be adapting the Rat Genome Database literature
pipeline to use the PubFetch BioMOBY services to act as its source for
literature data download.

The current version of PubFetch is available from the GMOD cvs:
http://cvs.sourceforge.net/viewcvs.py/gmod/pubfetch/

Implementation of PubSearch at RGD
Following a curator review of existing PubSearch functionality, a variety
of new features were requested by the RGD curators to enable a more
'article-centric' view of the PubSearch database. This has been
implemented by the TAIR group and plans are underway to install this
latest version of PubSearch at RGD, populate with RGD/Rat data and test
in the RGD curation process.

PubTrack
PubTrack is a monitoring tool that tracks objects as they move through
a process or workflow. Existing workflow tools move data through a
specified process, passing datasets to applications and retrieving
results and passing them to the next step in the flow. PubTrack does
not aim to direct or control workflow and it does not track the dataset
as a whole, it provides a higher resolution and tracks the data objects
within the dataset, enabling users to follow a particular object as it
moves through a process.

Progress to date:
* Review of existing workflow tools and schemas has been completed.
* The initial PubTrack schema has been developed and implemented in PostgreSQL
* Initialization scripts have been written to populate the PubTrack
database with initial object and process data. Perl scripts are used to
parse and load initialization data in a standard XML format; a DTD is
available and is used to confirm the data formatting.
* An API is under development to allow 3rd party applications to
communicate with PubTrack to initialize and update the tracking
information for objects under observation. This is being developed and
tested using data from a proteomics MS/MS analysis pipeline that is
being built in my lab.
* A basic web user interface is in development to provide end-users with
the ability to view objects and their progress through their designated
processes.
* The concept of 'estimated time of completion' has been added to allow
long term planning and project tracking. For example, the entire process
of curating an article might typically take 3 days, so the estimated time
of completion would be 3 days after the start of curation. This estimate
can be displayed on a Gantt chart and updated as individual steps in the
process are completed, allowing an increasingly refined view of the
completion date. This is being used in our proteomics tracking - component
1 generates tissue samples from animals in a process that takes upto 3
weeks to complete. By tracking the progress and updating the completion
time estimate using PubTrack it allows lab members in component 2 to plan
ahead. They are able to see what samples will be ready and on what date
they will be ready and this is updated as the process progresses.

Current Work
When the API is stabilized we will deploy PubTrack in the existing RGD
literature curation pipeline and ultimately in combination with PubSearch
at RGD. This will create an entire system allowing tracking of literature
across a heterogeneous system as it is downloaded from PubMed, into
PubSearch, screened, moved to RGD's Oracle db, curated and ultimately
filed. A more comprehensive user interface will be developed based on the
experiences from the proteomics pipeline and the RGD curation pipeline.
The goal is to provide generic tracking views and a way to allow specific
users to customize the displays, charts and reports if needed.

PubTrack documents including schema, loading scripts, etc. can be found on
the GMOD CVS.
http://cvs.sourceforge.net/viewcvs.py/gmod/pubtrack/



PubSearch update

We've migrated our database schema over to one that should be more
compatible with a Chado schema --- all of our table names are now prefixed
with a 'pub_' prefix, and we've done some column renaming so that we use
consistant names throughout the system.

Our production server has been also upgraded from MySQL3 to MySQL4, and
we've rewritten some parts of Pubsearch to take advantage of the
transaction support that the new MySQL provides.  We've also added
referential integrity constraints to the foreign keys in our tables.

We've adopted another tool called JCoverage to help us identify areas of
our code that are not being touched by our unit cases, and have started to
tighten up our test cases so that our major classes are being exercised.

We've worked toward removing dependencies on external resources.  Hit
generation now works directly from the Java codebase, rather than from an
external Python script.  We've continued work on a keyword term browser to
replaced the highly munged version of AmiGO that we are running locally.



GBrowse Project

Coordinator: Lincoln Stein
Major Developers: Scott Cain
              Aaron Mackey
          Toshiaki Katayama
          Vsevolod Ilyushchenko
          Marc Logghe
          Sheldon McKay
          Mark Wilkinson

DESCRIPTION:

GBrowse is a web-based browser for genome annotations.  It is intended to
complement Apollo by providing a search, browse and drill-down display for
sequence-based features without the need for prior software installation.
GBrowse uses a database adaptor system to connect to a single primary data
source, and a temporary flat-file system to layer an arbitrary number of
third-party annotations on top of the primary data.  A plugin system is used
to add new functionality to gbrowse, such as more advanced searches, and
dynamically-computed features such as ab initio gene predictions.  An
internationalization layer allows GBrowse to display button labels, menus and
help text in a variety of common world languages.

The following gbrowse database adaptors currently exist:

      Bio::DB::GFF (oracle, postgresql & mysql)
       Well-tested and in production.

      Bio::DB::Das::Chado (postgresql)
       Well-tested and in early production.

      GenBank proxy
       Well-tested and in production.  Does not handle
       full-genbank keyword searches properly.

      Bio::DB::Das::BioSQL
       Adaptor for the BioSQL schema.  In beta test.

      Bio::Das
       Adaptor for DAS sources. Released, but probably best
       considered in beta test.

GBrowse has been downloaded from SourceForge 1,830 times, but this is
a poor way to count the number of GBrowse users.  A more conservative
estimate of users comes from tallying bug reports, which ensures that
the user has at least tried to install the software.  However, it
represents an undercount.  In any case, we can confirm that at least
100 laboratories have installed GBrowse.  As the list attached to the
bottom of this report shows, GBrowse can be found in academic,
governmental and commercial organizations in North America, South
America, Europe, Asia, Africa and Australia.

RECENT PROGRESS:

Since the last status report, we have added the following features to
GBrowse:

1) SVG output

Users can now click on a link labeled "Publication Quality Image" and
download a Scaleable Vector Graphics version of the current view.  SVG
is an editable format that can be manipulated with popular graphics
programs such as Adobe Illustrator, and can be reprinted by journals
without the pixelation that plagues bitmapped images.

2) Security

Tracks can now be protected by username & password, restricted to
certain hosts, or limited to hosts presenting certain classes of RSA
(digital) certificates.  A restricted track does not appear on the
screen of unauthorized users, allowing system administrations to
present a mix of proprietary and public data.

3) DAS support

GBrowse can now run on top of distributed annotation system sources.
DAS is supported in three ways:
    a) As an external annotation source
       Users can layer remote DAS tracks on top of the current view.
       The remote DAS tracks will remain active from session to
       session.  The GBrowse administrator can preconfigure a set
       of "recommended" DAS sources, which will then appear in a
       user-selectable menu.

    b) As a primary database
       GBrowse can now be configured to use a local or remote DAS
       database as its primary data source.  This means that one
       can point GBrowse at the UCSC or ENSEMBL databases and
       immediately begin browing them using the GBrowse user
       interface.

    c) As a DAS source
       GBrowse will act as a DAS server.  At the administrator's
       discretion, all or selected tracks can be made exportable
       via DAS, allowing sequence features be shared between
       GBrowse instances or between GBrowse and other DAS clients.

4) Feature filtering and highlighting

A new filtering and highlighting API allows plugins to hide features
based on a set of user-supplied criteria or to highlight them in
various colors.

5) New adaptors

In addition to the DAS adaptor, we have added an experimental BioSQL
adaptor to GBrowse.  BioSQL is a flexible database schema designed by
the BioPerl & BioJava projects for the purposes of holding
GenBank/EMBL records in a relational format.

6) Support for GFF3 loading & dumping

GBrowse now can load and dump sequence annotations in GFF3 format
(http://song.sourceforge.net), a preliminary specification that
improves on the current GFF sequence feature format.  The advantage of
this format is that it uses the Sequence Ontology, a controlled
vocabulary of sequence feature types.

7) Integrated MOBY support

The BioMOBY system (www.biomoby.org) is a web services system that
allows users to quickly locate and invoke bioinformatics services.
GBrowse now has an interface which allows it to find services that
will operate on selected sequence features.  For example, GBrowse can
present users with a list of current services that will operate on
Drosophila gene names.

8) Support for writeback

A writeback layer has been added to GBrowse to allow external editors
to update the underlying database.  This has been tested successfully
with the Artemis editor in the context of a USDA pathogens database
project.  Testing with Apollo is still underway.  Currently it is
recommended to edit sequence databases via the shared Chado schema and
the Apollo->Chado->GBrowse route, rather than to use Apollo->GBrowse
directly.

9) New glyphs

We have recently added a number of new glyphs for use with the
International HapMap Project.  New glyphs include a "weighted allele"
glyph that indicates the major and minor alleles of a single
nucleotide polymorphism, and a set of glyphs for visualizing haplotype
blocks.

10) Bug fixes

Performance has been improved when uploading large 3d party annotation
files.  Nucleotide-level alignments have been fixed when the display
is "flipped."  The feature name search methods have been cleaned up to
provide more consistent behavior.

PLANS FOR THE FUTURE:

Performance is a concern when viewing large numbers of uploaded
third-party features. We plan to fix this by implementing a indexed
flat file cache for uploaded features.

The user interface needs to be improved in some respects.  One useful
idea is to place an icon to the left of each track to indicate whether
it is in a expanded or collapsed state.

The ability to use a different DAS source for each track, which is a
feature of ISB GBrowse, will be ported over.

As always, we are looking for volunteers fluent in non-English
languages to create and update the internationalization files.

Contact: Lincoln Stein <lstein@cshl.org>

APPENDIX. Confirmed users of GBrowse:

        Agricultural Biotechnology Center, Hungary
        BAWI, S. Korea
        Baylor College of Medicine
        Biocrates GmbH, Innsbruck
        Brandeis University
        Bristol-Meyers Squibb
        British Columbia Centre for Diseaes Control
        CIRAD, France
        CSIRO, Australia
        Cambridge University (multiple labs)
        Center for Genomics & Bioinformatics, Stockholm
        Center for Genomics and Bioinformatics, Stockholm
        Centre de Genetique Moleculaire, CNRS
        Cold Spring Harbor Laboratory (multiple labs)
        Compugen
        Concordia University, Canada
        Cornell Medical School
        Cornell University
        DNA Landmarks, Inc.
        Donald Danforth Plant Sciences Center
        Duke University (multiple labs)
        EMBL, Heidelberg
        EuGenes (hacked copy)
        Faculdade de Medicina de Ribeiro Preto, So Paulo
        FlyBase
        Foundation for Research and Technology, Crete
        Fundao Hemocentro, Sao Paolo
        Genoscope, France
        GrainGenes
        Harvard University
        Hospital for Sick Kids, Toronto
        Illinois Institute of Technology
        Incyte Corporation
        Inpharmatica, Ltd.
        Institute for Systems Biology, Seattle
        Institute of Molecular and Cell Biology, Singapore
        International Rice Research Institute, Phillipines
        John Innes Centre
        KEGG
        Kansas State University
        Karolinska Institute
        Kennedy Krieger Institute
        Lawrence Berkeley Laboratories
        Marine Biological Laboratories, Woods Hole
        Massachusetts Institute of Technology (multiple labs)\
        Mayo Institute
        McGill University
        Meat Animal Research Center, University of Nebraska
        Medical University of South Carolina
        Michigan State University
        NHGRI, NIH
        National Cancer Institute, Frederick Cancer Center
        New York University (multiple labs)
        North Carolina State University
        Northern Illinois University
        Northwestern University
        Oklahoma State University
        Open Informatics Consulting Corp.
        Oxagen Corp.
        Pasteur Institute, Paris
        Pioneer Corporation
        QIAGEN Operon Corp.
        RIKEN (multiple labs)
        RatDB
        Regulome, Inc.
        Rhobio (Bayer CropScience SA & Biogemma joint venture)
        Rigshospitalet, Copenhagen
        Rockefeller University
        Roslin Institute, Edinburgh
        Russian Academy Medical Sciences
        Serono International Corp, Geneva
        Simon Frasier University
        South Africa National Bioinformatics Institute
        Southern Illinois University
        St. Jude Children's Research Hospital, Memphis
        Stowers Institute for Medical Research
        Texas A&M (multiple labs)
        The Institute for Genome Research
        Tulane University
        Tulane University
        University California Davis
        University of Arizona (multiple labs)
        University of British Columbia
        University of California Santa Barbara
        University of Georgia (multiple labs)
        University of Minnesota
        University of Muenster
        University of New South Wales, Australia
        University of Oklahoma (multiple labs)
        University of Pennsylvania (multiple labs)
        University of Southern California
        University of Texas
        University of Toronto
        University of Virginia
        University of Washington
        Universitt Giessen
        Universit de Lige, Belgium
        Wageningen Universiteit & Researchcentrum, Netherlands
        Washington University at St. Louis (multiple labs)
        WormBase
                deVGen, Belgium


CMAP
Main developer:     Ken Clark

Recent improvements include:

*   Now CGI-based (no more mod_perl dependencies), making installation
    much easier (and much more like Gbrowse)
*   Added SVG output
*   Added multiple aliases for features
*   Added support for arbitrary attributes for db objects
*   New cross-reference scheme allows for unlimited xrefs on most db objects
*   Experimental XML export/import of data added
*   User tutorial added
*   Faster, fewer bugs, etc.

CMAP is known to be in use by:

Barry Marler (Andy Paterson), Alex Feltus, Pratt: UGA
Rex Nelson, Chet Langin, Xiaokang Pan: Iowa State
Michelle Bobo: Oregon Health & Science University
Victor Ulat, Richard Bruskiewich: IRRI
Matthew Hobbs: University of Sydney (Australia)



                          Pathway Tools Status Report
                                  Peter Karp
                                February 5, 2004

Please note that the full history of updates to Pathway Tools can be
found at URL
http://bioinformatics.ai.sri.com/ptools/release-notes.html

Significant updates funded under this grant since the last report are
as follows.

o We have implemented the proposed Napster-like peer-to-peer sharing
of Pathway/Genome Databases via a central network registry server.
Pathway Tools users will be able to use the software to register new
PGDBs that they create to this central registry server at SRI, and
they will be able to use the software to browse the registry and
to retrieve and install PGDBs listed there for local analysis.

o Pathway Tools has been extended to support annotation of protein
domains, sites, and chemical modifications.  We have created an
ontology of domain, sites, and modification types.  The Pathway/Genome
Editor tools have been extended to allow users to interactively
annotate these features on protein sequences, and the Pathway/Genome
Navigator has been extended to display these annotated features.

o We have added a batch-processing mode to the portion of Pathway Tools
that creates new Pathway/Genome Databases to allow large-scale automated
processing of multiple genomes without manual intervention.  We have
undertaken a collaboration with the European Bioinformatics Institute,
who are interested in applying Pathway Tools to generate Pathway/Genome
Databases for a large number of genomes.

o We have integrated an algorithm for pathway hole filling into
Pathway Tools.  A pathway hole is a reaction step in a metabolic
pathway for which no enzyme has been identified in the genome of
an organism.  The pathway hole filler uses a combination of techniques
to predict which genes in the genome code for these missing enzymes.
[This algorithm developed under separate funding.]

o We have completely re-designed the menus of the desktop version
of Pathway/Genome Navigator to be more consistent with other
graphical interfaces, more intuitive to the user, and to provide
more screen area to display of visualizations.

o We have integrated an SBML (Systems Biology Markup Language) output
tool written in the Church lab at Harvard into Pathway Tools, allowing
the reaction network within a Pathway/Genome Database to be exported
to SBML format, from which it can be imported into a number of
simulation and analysis software packages.

o We have reworked the display of information about protein complexes
within Pathway Tools to increase the clarity of this information.

o The preceding capabilities will be present in the February release
of Pathway Tools.

o We have received many emails from users reporting bugs, and asking for
information.

o 80 groups have licensed Pathway Tools to date.

o Pathway/Genome Databases available through the web include:

   o Saccharomyces cerevisiae, Stanford University
     http://pathway.yeastgenome.org/biocyc/

   o Plasmodium falciparum, Stanford University
     plasmocyc.stanford.edu

   o Mycobacterium tuberculosis, Stanford University
     BioCyc.org

   o Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington
     Arabidopsis.org:1555

   o Methanococcus janaschii, EBI
     Maine.ebi.ac.uk:1555   (availability intermittent)


                          Pathway Tools Status Report
                                  Peter Karp
                                April 20, 2004

Please note that the full history of updates to Pathway Tools can be
found at URL
http://bioinformatics.ai.sri.com/ptools/release-notes.html

Significant updates funded under this grant since the last report in
February 2004 are as follows.

o Version 8.0 of Pathway Tools was released on March 12, 2004.
SRI continues to hold to our planned schedule of two releases of
Pathway Tools per year.

o 275 groups have licensed Pathway Tools to date.  The large jump
in this number since the last report reflects the fact that these
numbers also include groups who use Pathway Tools to query
existing Pathway/Genome Databases (not reported earlier), in addition
to groups who use it to create new databases.

o We have made very significant progress on development of an
algorithm to automatically lay out the one-page metabolic overview
diagram that shows the full metabolic network of an organism -- the
algorithm is now working.  We are also in the process of adding new
components of the cellular machinery to this diagram.

o SRI has hosted two 4-day training sessions for Pathway Tools.
The dates and 26 attendees are listed below.  Most attendees have
brought genomes with them to the training sessions, and have left
with draft Pathway/Genome Databases.

Tutorial on March 15-18, 2004

1. John Burke         Biotique Inc.
2. Guillaume Meurice      Pasteur Institute
3. David Simon        Pasteur Institute
4. Gregory P. Fournier    MIT
5. Alex Picone        Biatech
6. John Bashkin       SRI
7. Tit Yee wong       University of Memphis
8. Ken Kaufman        UC Berkeley
9. Jeremy Glasner     University of Wisconsin
10. Lisa Herron-Olson     University of Minnesota
11. Devaki Bhaya      Carnegie Institution


Tutorial on April 19-22, 2004

1   Dr. Matthew Berriman    The Wellcome Trust Sanger Institute
    T. brucei & L. Major
2   Herbert Chiang      Washington University
    Bacteroides thetaiotaomicron
3   Clinton Fernandez   University of British Columbia
    Rhodococcus sp. RHA1 (~10MB)
4   Lisa Koski      University of Montreal, Canada
5   Rebecca Krupp       UCLA
    Methanosarcina acetivorans
6   Joanne Luciano      BioPathways Consortium
    Prochlorococcus marinus MED4
7   Jasintha Maniraja   Universite Libre de Bruxelles
    Mus musculus
8   Linyong Mao     Pacific Northwest National Laboratory
    Shewanella oneidensis
9   Michael P. McLeod   University of British Columbia
    Rhodococcus sp. RHA1 (~10MB)
10  Dylan Morris        CalTech
    Mycoplasma genitalium
11  Gavin Murphy        CalTech
    Bdellovibrio
12  Joo-Heon Park       University of Tex-Houston Med School
    Treponema pallidum
13  Liviu Popescu       Cornell University, Computer Science
    Sacaromyces cerevisae
14  Christopher Reigstad    Washington University
    unpublished uropathogenic E. coli
15  Haluk Resat     Pacific Northwest National Laboratory
16  Jian Song       Los Alamos National Laboratory
    Pseudomonas aeruginosa



GMOD Project Status     April 2004        D. Gilbert (gilbertd@indiana.edu)

Project members:  Don Gilbert, Josh Goodman, Paul Poole,
Vasanth Singan (student), at Indiana University.

Projects in development for GMOD:

(1) LuceGene, document/object search/retrieval for genome data
www.gmod.org/lucegene/   eugenes.org:8081/gmod/lucegene/
version 1.2 (alpha), released for public use April 2004.
In use at FlyBase.net, euGenes.org, wFleaBase. LuceGene is similar in
concept to the bioinformatic databank access tool SRS, and web search
systems such as Google. Based on Lucene, this Java program is fast and
flexible at search and retrieval of complex data objects.  It
outperforms Chado Postgres database by 10x or more at gene object
retrieval.

(2) Genome Directory System, data mining access to genome data
www.gmod.org/gds/
In development, web services for SOAP access to genome data and bio
sequence databanks.  Plan to provide production data mining services
through this including FlyBase, euGenes genomes and Bio-Mirror/IUBio
biosequence databanks. Will add to ARGOS package for genome databases.
Includes plan to test FlyBase data analyses over TeraGrid, Fall 2004.

(3) ARGOS, a replicable genome information system
www.gmod.org/argos/  flybase.net/argos/  eugenes.org/argos/
Version 0.7 (alpha, March 2004).
ARGOS is used now for replicating public web-genome databases. Contains
all of FlyBase, euGenes, wFleaBase, and some other services.
Contents include 10 GB multi-genome data (euGenes), 8 GB of Drosophila
(FlyBase), 500 MB common software, servers, binaries).

Miscellany:
gmod/schema/XMLTools/ChadoSax/ reader  for chado.xml provides
  flybase annotation data access.
gmod/schema/GMODTools/  Perl modules using GMOD 0.001 release for
   managing miscellany sequences (EST, GSS, etc) in Chado database
   Used now in Daphnia / wFleaBase genome database (eugenes.org/daphnia)
Apollo data search/retrieval system used at
   flybase.net/apollo/
   a web CGI using Chado Postgres + LuceGene
   for retrieval Game XML annotations by
   lookup of gene name, genome location, other attributes.
Tested, aided development, and used GMOD release 0.001, Postgres Chado,
XORT, Chado::DBI, GBrowse, etc. tools for FlyBase and wFleaBase, where
they now form the basis of data management.



GMOD Update from the Saccharomyces Genome Database (SGD)

    Before the last GMOD meeting at Berkeley, SGD released several GMOD
software packages (Blast Graphic Viewer, Restriction Graphic Viewer and
GO Graphic Viewer). Since then, we have been working on incorporating
existing GMOD products into new tools and resources at SGD. Here is a
list of projects that are currently under development or already in
production.

1. New Fungal BLAST using BLAST Graphic Viewer.
    SGD has created a new Fungal BLAST interface using the BLAST Graphic
Viewer. This new tool can be used to do BLASTN or TBLASTN searches using
any sequence of choice against any combination of fungal sequence datasets,
including genome sequences of fungal model organisms and pathogens, ESTs,
and other fungal sequence sets in GenBank. The fungal BLAST search at SGD
can be accessed from this URL.

    http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast-fungal.pl


2. GBrowse at SGD
    GBrowse has been set up at SGD. SGD is still testing the software
before making a general announcement about the availability of the
software.  This software is running on top of a MySQL database whose
tables are populated from a flat file in GFF3 format (refer to the third
topic for detail). GBrowse at SGD can be accessed from this URL.

    http://www.yeastgenome.org/cgi-bin/SGD/gbrowse/gbrowse/yeast

3. GFF3 file format
    SGD has started to provide the sequence features of S. cerevisiae
genome in a flat file, which is fully compatible with GFF3 format.
This file is used as the data input to load the MySQL database for
GBrowse and the PostgreSQL database running Chado schema for SGD Lite
at Princeton. This file is updated every week on SGD's ftp site. This
file is available for download from this URL.

    ftp://genome-ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/SGDGFF3.gff


4. SGD Lite and CHADO
    The SGD colony at Princeton has been working on installing GMOD
release 0.002.  Both versions of the Chado schema in these releases
(.001 and .002) have been successfully installed and loaded (via a
modified GFF3 file) on a desktop running Mac OS 10.3.2 using the
included installation scripts.  We are currently working on installing
0.002, including GBrowse, on an Apple X server running 10.3.2.  We plan
to assemble installation notes/documentation and distribute them during
the meeting.

5. Textpresso Beta testing
    SGD has a wealth of literature information. We want to provide
expanded text searching to our users, since we have an abstract and/or
full text for most of our references. Textpresso is an information
retrieval system developed by Wormbase at Caltech. Eimear Kenny spent
two weeks at SGD to help set up a test version of Textpresso. The SGD
Textpresso can be accessed from this URL.

    http://www.yeastgenome.org/textpresso/

Currently, we are working on improving Textpresso's software
performance, as well as developing a yeast version of the Textpresso
ontology. We improved the performance of the markup script (text2xml.pl)
by 50%. We are also considering a few options to improve the indexing
mechanism. With regard to the ontology, we have modified the 'Gene'
and 'Localization in Time and Space' categories.  We are also currently
working on a few other categories, such as Allele, Transgene and
Phenotype, in order to best reflect the biology in S. cerevisiae.

Categories:

Documentation

Community

Tools