Generic Model Organism Database Construction Set
GMOD Meeting April, 2004
Mount Vernon Room
9:00am | Introductions Scott Cain (CSHL) |
9:20 | Don Gilbert (FlyBase, Indiana University) |
10:30 | Break |
10:45 | Frank Smutniak (FlyBase, Harvard University) |
11:20 | Stan Letovsky (FlyBase, Harvard University) |
11:45 | Lunch (on your own-many good restaurants-check with a local) |
Mount Vernon Room
1:30 | Arek Kasprzyk (EBI) |
2:00 | GMOD/Turnkey web demo Brian O'Conner (UCLA) |
2:30 | Break |
3:00 | Eimear Kenny (WormBase, CalTech) |
3:30 | Toshiaki Katayama (Human Genome Center, University of Tokyo, Japan) |
3:45 | Break |
4:00 | David Emmert (FlyBase, Harvard University) |
4:30 | Scott Cain |
Terrace Room
1:30 Jennifer Wortman (The Institute for Genomic Research)
1:50 Shannon Schlueter (Arabidopsis thaliana Plant Genome Database, Iowa
State University)
2:10 Aniko Sabo (Genome Sequencing Center, Washington University School
of Medicine)
2:30 Break
2:50 Madeline Crosby (FlyBase, Harvard University)
3:10 Kim Worley (Human Genome Sequencing Center, Baylor College of
Medicine)
3:30 Astrid Terry (Joint Genome Institute)
3:50 Break4:10Chinnappa Kodira (Broad Institute)
4:30 Michele Clamp (Broad Institute)
4:50 Break
5:00 Group discussion
6:00 Dinner (on your own-see above)
Mount Vernon Room
9:00 | GMOD Alpha release, Part II |
The goal here is to try to get the gmod alpha installed on computers to test the installation and working issues with the release. It is almost certainly the case that there will also be time for “breakout” sessions for smaller groups to discuss a variety of topics. Suggestions will be accepted both before and during the meeting.
Terrace Room
9:00 | Apollo Demo Sima Misra (FlyBase & Berkeley Drosophila Genome Center) |
9:40 | Nomi Harris (FlyBase & Berkeley Drosophila Genome Center) |
10:00 | Hands-on Apollo workshop for curators Breakout session for Apollo developers |
12:00 | Lunch |
1:30 | Q&A session with curators & Apollo developers |
2:30 | Hands-on Apollo workshop for curators Breakout session for Apollo developers |
4:30 | Q&A session with curators & Apollo developers Group discussion |
6:00 | Reservation on us, food paid for by you |
Mount Vernon Room
9:00 | Updates from the previous day Scott Cain and Sima Misra |
10:00 | Break |
10:15 | Bill Gelbart (FlyBase, Harvard University) |
11:15 | Break |
11:30 | Closing remarks, planning for the next meeting Scott Cain |
GMOD Progect Progress Reports
April, 2004
-----------------------------
The past four months have seen the first two releases of gmod, which will
become the suite of model organism database software. The first release,
version 0.001 (alpha), was release in January, 2004. The main goal of that
release was to establish a release procedure. The release consisted of a
database schema, referred to as chado, which is the database schema
developed primarily by FlyBase developers at Harvard and BDGP. Additionally,
there were a variety of tools for installing and loading data into the
database which were developed primarily by Allen Day at UCLA and Scott
Cain at CSHL. Finally, there was a compatible version of the Generic
Genome Browser with a chado database adaptor developed to allow browsing
of genome features directly from the database.
The second release, also an alpha release, consisted of the same components,
and was release in March, 2004. In this release, the installation procedure
improved considerably, and a prerequisite that had caused testers difficulties
was removed. During the GMOD meeting in April, this release was installed
by several attendees during a workshop. Several suggestions were made that
will be implemented in the next release.
There are several items planned for addition or improvement in the next
two releases. Tools to allow importing and exporting XML formatted data
from chado will be included, which will allow the sequence annotation tool,
Apollo, to be used with chado. Addtionally, template based web front end for
chado called turnkey will be included in an upcoming release. This software is
still early in the development process, but when it was presented to
developers at the GMOD meeting in April, there was considerable interest
in getting it included in a gmod release as soon as possible.
Longer term goals for gmod releases are including pubsearch and pubfetch.
The process of porting these applications has begun and is expected to be
complete by the end of the year. A tool for liturature based sequence
annotation, called JavaSEAN, is expected to be included in gmod in a similar
time frame. Additionally, there are plans from the Apollo developers to
create a new version of Apollo that will be able to read and write directly
to the database without using an XML intermidary, which will simplify the
process of sequence annotation considerably.
Apollo Progress Report (11/2003 - 4/2004)
Major improvements in release 1.3.6 (11/3/03):
Apollo now runs under JDK1.4, which works better on most platforms.
Can rubberband a region on the axis and the selected sequence will pop up
in a Sequence window.
Results that represent hits against sequences that are new to their
respective database (as indicated in tiers file) are shown with a box
around them, so that the curator can immediately see which results are
new and need to be looked at.
Search (Find) now allows full regexps.
Instead of having the config files in $HOME/.apollo be slightly modified
copies of the ones in APOLLO_ROOT/conf, you can now put ONLY the stuff
you want changed into your personal cfg files. Apollo will first read
the ones in APOLLO_ROOT/conf, and then read your personal cfgs and apply
any modifications.
Synteny (see Synteny section at end)
Major improvements in release 1.4.0 (internal release) (2/9/2004):
New game.tiers file format (easier to read and change). If you have an
old game.tiers, it will be autoconverted to the new format.
Better handling of non-gene annotation types. New glyphs for showing
them in main Apollo display.
New annotations are automatically assigned the type (e.g. gene, tRNA,
etc.) appropriate to the evidence that was used to create them. (Type
can then be changed in the annotation info editor, if desired.)
Structured transaction records are now added to the XML when you save.
They include the type of object that changed (e.g. TRANSCRIPT;
ANNOTATION; COMMENT), the operation (e.g. ADD, SPLIT, etc.), the relevant
names and/or IDs before and after the transaction, and the user and
time/date when the change was made.
Support for translational exceptions, including frame shifts and one base
pair genomic sequencing errors.
UTRs are now shown in a different (configurable) color from the rest of
the gene.
Restriction enzyme mapper:
- Cut sites show up in main window (near the axis)
- Can now map multiple restriction enzymes at once
- Table of restriction fragments; can be selected for viewing in
Sequence window
Annotation info window:
- Now has integrated annotation tree
- Shows arbitrary properties for annotations and transcripts (including
validation_flag)
- Shows translational exceptions and genomic sequencing errors
- Lets you edit annotation ID as well as name/symbol
Ability to tag results by selecting from a list of comments, which are
specified (as ResultTags) in game.style. Tagged results are crosshatched
in pink in the display.
Fixed updating of peptide sequences.
Improvements in releases 1.4.1 (3/12/04) and 1.4.2 (3/18/04):
Red/green markers at axis show where sequence/region ends.
To help you identify splice sites that are unconventional, colored
triangles appear in the annotation glyph.
Can now load D. melanogaster data from r3.1 (gadfly) and r3.2 (chado)
(both via cgi).
1.4.3 (4/19/04):
Let users get the sequence of the entire segment you're looking at, not
just a rubberbanded section. [File -> Save sequence]
Synteny progress, 11/03-4/04:
- Synteny now works with GAME. You can load one species and then use the
blast or syntenic block results to another species (for now it's pseudo)
to load another species. The other species is loaded with the same range
around that feature. Links between the two species are automatically
derived from the blast link features that are present in both datasets
(no explicit link file needs to be specified).
- Database chooser was added to select the different species databases.
- Able to switch back and forth from synteny data adapter to regular data
adapters without restarting Apollo.
- Can save and edit (edit could use some rigorous testing)
- Can home in on link from link popup menu. Zooms and shows the strands
of homed in link, strands not in link are hidden.
- Species now zoom and scroll together by default. Can unlock zoom with
shift key, and unlock scroll with menu item.
- You can now config links between 2 curation sets that contain links to
each other. link_type, source and hit species are specified in the linked
type in the tiers file. This works with game, in theory could be made to
work with other adapters that have linked data embedded in the species
data.
Textpresso: A progress report
Eimear Kenny, Hans-Michael Mueller and Paul Sternberg
Updates made to Textpresso since September 2003:
Textpresso for Yeast Literature
(Toward a generic MOD information retrieval/extraction search engine)
SGD developers and curators met with Eimear Kenny for two weeks at
the begining of March at Stanford to build a Textpresso search engine for
Yeast. During that period the Textpresso software was installed on a
Solaris system and three builds with a test corpus of ~400 full text
journal articles were completed. In addition, the Textpresso Ontology for
worm literature was modified to a functional preliminary ontology for
yeast literature. Plans to expand the corpus to 10,000 yeast papers and
make improvements to the yeast ontology are underway at Stanford.
Integration of Textpresso into Literature Curation Pipeline
We have integrated Textpresso to the Wormbase curation pipeline
to expediate the extraction of genetic interaction information from the
literature. A prototype curation interface has been developed to enable a
curator to extract data from sentences returned by a Textpresso query for
genetic interaction. We find that these Textpresso sentences are enriched
3-fold for gene-gene interactions compared to sentences that mention two
or more gene names and 39-fold compared to random sentences from the
literature.
Textpresso MOD interface
We have generated a Wormbase-like interface for Textpresso to integrate
the Textpresso information retrieval engine in the Wormbase web-site.
http://www.textpresso.org/cgi-bin/wb/textpressoforwormbase.cgi?allabstracts=on&searchmode=sentence&searchtargets=Paper&searchtargets=Abstract
Textpresso Package
Hans-Michael Mueller is working on packaging Textpresso for release in
the first half of this year.
Textpresso paper ... under review
A Textpresso publication is currently under revision.
PubFetch/PubTrack Progress Report (April 2004)
PubFetch
PubFetch is a tool for accessing literature from various online resources.
The goal is to provide a common interface and common format to downstream
applications to allow them to query different literature repositories in
a single, unified fashion.
PubFetch has been implemented in two forms:
* Java servlet core + simple web interface to provide interactive access
to PubFetch
* Provides access to PubMed and Agricola databases
* BioMOBY wrapper around servlet core to provide webservice access to PubFetch
A variety of new features have been introduced:
* Duplicate filtering - running the same search on multiple data sources
results in some duplication of articles, the duplicate filter detects
these articles returning a non-redundant set of data. Database Ids from
both sources are maintained in the non-redundant set.
* The web interface version highlights keywords in the search results to
aid in review of the returned articles.
* Connection to full text - a hyperlink to the full text is returned (if
available from PubMed)
* Filtering of 'ahead of print' articles - Abstracts are appearing in
PubMed and being assigned PubMed Ids prior to being published and are
being reassigned PubMed Ids after publication. PubFetch allows filtering
of these ahead of print articles to retrieve only published articles.
The BioMOBY interface provides the following services:
* SearchPubmed - Search PubMed for given query and get PMIDs
* GetPubmed - Retrieve PubMed articles in MEDLINE display format for given
PMIDs
* FetchFull - Get FullText for given PMID
* fetchAgID - Search Agricola for given query and get Agricola accession number
* fetchAgDoc - Get Agricola document in MEDLINE like format for given
Agricola accession number
Current work
The integration of PubFetch and PubSearch is in progress, our goal is to
have PubSearch using the PubFetch core module for literature retrieval by
summer of 2004. We will be adapting the Rat Genome Database literature
pipeline to use the PubFetch BioMOBY services to act as its source for
literature data download.
The current version of PubFetch is available from the GMOD cvs:
http://cvs.sourceforge.net/viewcvs.py/gmod/pubfetch/
Implementation of PubSearch at RGD
Following a curator review of existing PubSearch functionality, a variety
of new features were requested by the RGD curators to enable a more
'article-centric' view of the PubSearch database. This has been
implemented by the TAIR group and plans are underway to install this
latest version of PubSearch at RGD, populate with RGD/Rat data and test
in the RGD curation process.
PubTrack
PubTrack is a monitoring tool that tracks objects as they move through
a process or workflow. Existing workflow tools move data through a
specified process, passing datasets to applications and retrieving
results and passing them to the next step in the flow. PubTrack does
not aim to direct or control workflow and it does not track the dataset
as a whole, it provides a higher resolution and tracks the data objects
within the dataset, enabling users to follow a particular object as it
moves through a process.
Progress to date:
* Review of existing workflow tools and schemas has been completed.
* The initial PubTrack schema has been developed and implemented in PostgreSQL
* Initialization scripts have been written to populate the PubTrack
database with initial object and process data. Perl scripts are used to
parse and load initialization data in a standard XML format; a DTD is
available and is used to confirm the data formatting.
* An API is under development to allow 3rd party applications to
communicate with PubTrack to initialize and update the tracking
information for objects under observation. This is being developed and
tested using data from a proteomics MS/MS analysis pipeline that is
being built in my lab.
* A basic web user interface is in development to provide end-users with
the ability to view objects and their progress through their designated
processes.
* The concept of 'estimated time of completion' has been added to allow
long term planning and project tracking. For example, the entire process
of curating an article might typically take 3 days, so the estimated time
of completion would be 3 days after the start of curation. This estimate
can be displayed on a Gantt chart and updated as individual steps in the
process are completed, allowing an increasingly refined view of the
completion date. This is being used in our proteomics tracking - component
1 generates tissue samples from animals in a process that takes upto 3
weeks to complete. By tracking the progress and updating the completion
time estimate using PubTrack it allows lab members in component 2 to plan
ahead. They are able to see what samples will be ready and on what date
they will be ready and this is updated as the process progresses.
Current Work
When the API is stabilized we will deploy PubTrack in the existing RGD
literature curation pipeline and ultimately in combination with PubSearch
at RGD. This will create an entire system allowing tracking of literature
across a heterogeneous system as it is downloaded from PubMed, into
PubSearch, screened, moved to RGD's Oracle db, curated and ultimately
filed. A more comprehensive user interface will be developed based on the
experiences from the proteomics pipeline and the RGD curation pipeline.
The goal is to provide generic tracking views and a way to allow specific
users to customize the displays, charts and reports if needed.
PubTrack documents including schema, loading scripts, etc. can be found on
the GMOD CVS.
http://cvs.sourceforge.net/viewcvs.py/gmod/pubtrack/
PubSearch update
We've migrated our database schema over to one that should be more
compatible with a Chado schema --- all of our table names are now prefixed
with a 'pub_' prefix, and we've done some column renaming so that we use
consistant names throughout the system.
Our production server has been also upgraded from MySQL3 to MySQL4, and
we've rewritten some parts of Pubsearch to take advantage of the
transaction support that the new MySQL provides. We've also added
referential integrity constraints to the foreign keys in our tables.
We've adopted another tool called JCoverage to help us identify areas of
our code that are not being touched by our unit cases, and have started to
tighten up our test cases so that our major classes are being exercised.
We've worked toward removing dependencies on external resources. Hit
generation now works directly from the Java codebase, rather than from an
external Python script. We've continued work on a keyword term browser to
replaced the highly munged version of AmiGO that we are running locally.
GBrowse Project
Coordinator: Lincoln Stein
Major Developers: Scott Cain
Aaron Mackey
Toshiaki Katayama
Vsevolod Ilyushchenko
Marc Logghe
Sheldon McKay
Mark Wilkinson
DESCRIPTION:
GBrowse is a web-based browser for genome annotations. It is intended to
complement Apollo by providing a search, browse and drill-down display for
sequence-based features without the need for prior software installation.
GBrowse uses a database adaptor system to connect to a single primary data
source, and a temporary flat-file system to layer an arbitrary number of
third-party annotations on top of the primary data. A plugin system is used
to add new functionality to gbrowse, such as more advanced searches, and
dynamically-computed features such as ab initio gene predictions. An
internationalization layer allows GBrowse to display button labels, menus and
help text in a variety of common world languages.
The following gbrowse database adaptors currently exist:
Bio::DB::GFF (oracle, postgresql & mysql)
Well-tested and in production.
Bio::DB::Das::Chado (postgresql)
Well-tested and in early production.
GenBank proxy
Well-tested and in production. Does not handle
full-genbank keyword searches properly.
Bio::DB::Das::BioSQL
Adaptor for the BioSQL schema. In beta test.
Bio::Das
Adaptor for DAS sources. Released, but probably best
considered in beta test.
GBrowse has been downloaded from SourceForge 1,830 times, but this is
a poor way to count the number of GBrowse users. A more conservative
estimate of users comes from tallying bug reports, which ensures that
the user has at least tried to install the software. However, it
represents an undercount. In any case, we can confirm that at least
100 laboratories have installed GBrowse. As the list attached to the
bottom of this report shows, GBrowse can be found in academic,
governmental and commercial organizations in North America, South
America, Europe, Asia, Africa and Australia.
RECENT PROGRESS:
Since the last status report, we have added the following features to
GBrowse:
1) SVG output
Users can now click on a link labeled "Publication Quality Image" and
download a Scaleable Vector Graphics version of the current view. SVG
is an editable format that can be manipulated with popular graphics
programs such as Adobe Illustrator, and can be reprinted by journals
without the pixelation that plagues bitmapped images.
2) Security
Tracks can now be protected by username & password, restricted to
certain hosts, or limited to hosts presenting certain classes of RSA
(digital) certificates. A restricted track does not appear on the
screen of unauthorized users, allowing system administrations to
present a mix of proprietary and public data.
3) DAS support
GBrowse can now run on top of distributed annotation system sources.
DAS is supported in three ways:
a) As an external annotation source
Users can layer remote DAS tracks on top of the current view.
The remote DAS tracks will remain active from session to
session. The GBrowse administrator can preconfigure a set
of "recommended" DAS sources, which will then appear in a
user-selectable menu.
b) As a primary database
GBrowse can now be configured to use a local or remote DAS
database as its primary data source. This means that one
can point GBrowse at the UCSC or ENSEMBL databases and
immediately begin browing them using the GBrowse user
interface.
c) As a DAS source
GBrowse will act as a DAS server. At the administrator's
discretion, all or selected tracks can be made exportable
via DAS, allowing sequence features be shared between
GBrowse instances or between GBrowse and other DAS clients.
4) Feature filtering and highlighting
A new filtering and highlighting API allows plugins to hide features
based on a set of user-supplied criteria or to highlight them in
various colors.
5) New adaptors
In addition to the DAS adaptor, we have added an experimental BioSQL
adaptor to GBrowse. BioSQL is a flexible database schema designed by
the BioPerl & BioJava projects for the purposes of holding
GenBank/EMBL records in a relational format.
6) Support for GFF3 loading & dumping
GBrowse now can load and dump sequence annotations in GFF3 format
(http://song.sourceforge.net), a preliminary specification that
improves on the current GFF sequence feature format. The advantage of
this format is that it uses the Sequence Ontology, a controlled
vocabulary of sequence feature types.
7) Integrated MOBY support
The BioMOBY system (www.biomoby.org) is a web services system that
allows users to quickly locate and invoke bioinformatics services.
GBrowse now has an interface which allows it to find services that
will operate on selected sequence features. For example, GBrowse can
present users with a list of current services that will operate on
Drosophila gene names.
8) Support for writeback
A writeback layer has been added to GBrowse to allow external editors
to update the underlying database. This has been tested successfully
with the Artemis editor in the context of a USDA pathogens database
project. Testing with Apollo is still underway. Currently it is
recommended to edit sequence databases via the shared Chado schema and
the Apollo->Chado->GBrowse route, rather than to use Apollo->GBrowse
directly.
9) New glyphs
We have recently added a number of new glyphs for use with the
International HapMap Project. New glyphs include a "weighted allele"
glyph that indicates the major and minor alleles of a single
nucleotide polymorphism, and a set of glyphs for visualizing haplotype
blocks.
10) Bug fixes
Performance has been improved when uploading large 3d party annotation
files. Nucleotide-level alignments have been fixed when the display
is "flipped." The feature name search methods have been cleaned up to
provide more consistent behavior.
PLANS FOR THE FUTURE:
Performance is a concern when viewing large numbers of uploaded
third-party features. We plan to fix this by implementing a indexed
flat file cache for uploaded features.
The user interface needs to be improved in some respects. One useful
idea is to place an icon to the left of each track to indicate whether
it is in a expanded or collapsed state.
The ability to use a different DAS source for each track, which is a
feature of ISB GBrowse, will be ported over.
As always, we are looking for volunteers fluent in non-English
languages to create and update the internationalization files.
Contact: Lincoln Stein <lstein@cshl.org>
APPENDIX. Confirmed users of GBrowse:
Agricultural Biotechnology Center, Hungary
BAWI, S. Korea
Baylor College of Medicine
Biocrates GmbH, Innsbruck
Brandeis University
Bristol-Meyers Squibb
British Columbia Centre for Diseaes Control
CIRAD, France
CSIRO, Australia
Cambridge University (multiple labs)
Center for Genomics & Bioinformatics, Stockholm
Center for Genomics and Bioinformatics, Stockholm
Centre de Genetique Moleculaire, CNRS
Cold Spring Harbor Laboratory (multiple labs)
Compugen
Concordia University, Canada
Cornell Medical School
Cornell University
DNA Landmarks, Inc.
Donald Danforth Plant Sciences Center
Duke University (multiple labs)
EMBL, Heidelberg
EuGenes (hacked copy)
Faculdade de Medicina de Ribeiro Preto, So Paulo
FlyBase
Foundation for Research and Technology, Crete
Fundao Hemocentro, Sao Paolo
Genoscope, France
GrainGenes
Harvard University
Hospital for Sick Kids, Toronto
Illinois Institute of Technology
Incyte Corporation
Inpharmatica, Ltd.
Institute for Systems Biology, Seattle
Institute of Molecular and Cell Biology, Singapore
International Rice Research Institute, Phillipines
John Innes Centre
KEGG
Kansas State University
Karolinska Institute
Kennedy Krieger Institute
Lawrence Berkeley Laboratories
Marine Biological Laboratories, Woods Hole
Massachusetts Institute of Technology (multiple labs)\
Mayo Institute
McGill University
Meat Animal Research Center, University of Nebraska
Medical University of South Carolina
Michigan State University
NHGRI, NIH
National Cancer Institute, Frederick Cancer Center
New York University (multiple labs)
North Carolina State University
Northern Illinois University
Northwestern University
Oklahoma State University
Open Informatics Consulting Corp.
Oxagen Corp.
Pasteur Institute, Paris
Pioneer Corporation
QIAGEN Operon Corp.
RIKEN (multiple labs)
RatDB
Regulome, Inc.
Rhobio (Bayer CropScience SA & Biogemma joint venture)
Rigshospitalet, Copenhagen
Rockefeller University
Roslin Institute, Edinburgh
Russian Academy Medical Sciences
Serono International Corp, Geneva
Simon Frasier University
South Africa National Bioinformatics Institute
Southern Illinois University
St. Jude Children's Research Hospital, Memphis
Stowers Institute for Medical Research
Texas A&M (multiple labs)
The Institute for Genome Research
Tulane University
Tulane University
University California Davis
University of Arizona (multiple labs)
University of British Columbia
University of California Santa Barbara
University of Georgia (multiple labs)
University of Minnesota
University of Muenster
University of New South Wales, Australia
University of Oklahoma (multiple labs)
University of Pennsylvania (multiple labs)
University of Southern California
University of Texas
University of Toronto
University of Virginia
University of Washington
Universitt Giessen
Universit de Lige, Belgium
Wageningen Universiteit & Researchcentrum, Netherlands
Washington University at St. Louis (multiple labs)
WormBase
deVGen, Belgium
CMAP
Main developer: Ken Clark
Recent improvements include:
* Now CGI-based (no more mod_perl dependencies), making installation
much easier (and much more like Gbrowse)
* Added SVG output
* Added multiple aliases for features
* Added support for arbitrary attributes for db objects
* New cross-reference scheme allows for unlimited xrefs on most db objects
* Experimental XML export/import of data added
* User tutorial added
* Faster, fewer bugs, etc.
CMAP is known to be in use by:
Barry Marler (Andy Paterson), Alex Feltus, Pratt: UGA
Rex Nelson, Chet Langin, Xiaokang Pan: Iowa State
Michelle Bobo: Oregon Health & Science University
Victor Ulat, Richard Bruskiewich: IRRI
Matthew Hobbs: University of Sydney (Australia)
Pathway Tools Status Report
Peter Karp
February 5, 2004
Please note that the full history of updates to Pathway Tools can be
found at URL
http://bioinformatics.ai.sri.com/ptools/release-notes.html
Significant updates funded under this grant since the last report are
as follows.
o We have implemented the proposed Napster-like peer-to-peer sharing
of Pathway/Genome Databases via a central network registry server.
Pathway Tools users will be able to use the software to register new
PGDBs that they create to this central registry server at SRI, and
they will be able to use the software to browse the registry and
to retrieve and install PGDBs listed there for local analysis.
o Pathway Tools has been extended to support annotation of protein
domains, sites, and chemical modifications. We have created an
ontology of domain, sites, and modification types. The Pathway/Genome
Editor tools have been extended to allow users to interactively
annotate these features on protein sequences, and the Pathway/Genome
Navigator has been extended to display these annotated features.
o We have added a batch-processing mode to the portion of Pathway Tools
that creates new Pathway/Genome Databases to allow large-scale automated
processing of multiple genomes without manual intervention. We have
undertaken a collaboration with the European Bioinformatics Institute,
who are interested in applying Pathway Tools to generate Pathway/Genome
Databases for a large number of genomes.
o We have integrated an algorithm for pathway hole filling into
Pathway Tools. A pathway hole is a reaction step in a metabolic
pathway for which no enzyme has been identified in the genome of
an organism. The pathway hole filler uses a combination of techniques
to predict which genes in the genome code for these missing enzymes.
[This algorithm developed under separate funding.]
o We have completely re-designed the menus of the desktop version
of Pathway/Genome Navigator to be more consistent with other
graphical interfaces, more intuitive to the user, and to provide
more screen area to display of visualizations.
o We have integrated an SBML (Systems Biology Markup Language) output
tool written in the Church lab at Harvard into Pathway Tools, allowing
the reaction network within a Pathway/Genome Database to be exported
to SBML format, from which it can be imported into a number of
simulation and analysis software packages.
o We have reworked the display of information about protein complexes
within Pathway Tools to increase the clarity of this information.
o The preceding capabilities will be present in the February release
of Pathway Tools.
o We have received many emails from users reporting bugs, and asking for
information.
o 80 groups have licensed Pathway Tools to date.
o Pathway/Genome Databases available through the web include:
o Saccharomyces cerevisiae, Stanford University
http://pathway.yeastgenome.org/biocyc/
o Plasmodium falciparum, Stanford University
plasmocyc.stanford.edu
o Mycobacterium tuberculosis, Stanford University
BioCyc.org
o Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington
Arabidopsis.org:1555
o Methanococcus janaschii, EBI
Maine.ebi.ac.uk:1555 (availability intermittent)
Pathway Tools Status Report
Peter Karp
April 20, 2004
Please note that the full history of updates to Pathway Tools can be
found at URL
http://bioinformatics.ai.sri.com/ptools/release-notes.html
Significant updates funded under this grant since the last report in
February 2004 are as follows.
o Version 8.0 of Pathway Tools was released on March 12, 2004.
SRI continues to hold to our planned schedule of two releases of
Pathway Tools per year.
o 275 groups have licensed Pathway Tools to date. The large jump
in this number since the last report reflects the fact that these
numbers also include groups who use Pathway Tools to query
existing Pathway/Genome Databases (not reported earlier), in addition
to groups who use it to create new databases.
o We have made very significant progress on development of an
algorithm to automatically lay out the one-page metabolic overview
diagram that shows the full metabolic network of an organism -- the
algorithm is now working. We are also in the process of adding new
components of the cellular machinery to this diagram.
o SRI has hosted two 4-day training sessions for Pathway Tools.
The dates and 26 attendees are listed below. Most attendees have
brought genomes with them to the training sessions, and have left
with draft Pathway/Genome Databases.
Tutorial on March 15-18, 2004
1. John Burke Biotique Inc.
2. Guillaume Meurice Pasteur Institute
3. David Simon Pasteur Institute
4. Gregory P. Fournier MIT
5. Alex Picone Biatech
6. John Bashkin SRI
7. Tit Yee wong University of Memphis
8. Ken Kaufman UC Berkeley
9. Jeremy Glasner University of Wisconsin
10. Lisa Herron-Olson University of Minnesota
11. Devaki Bhaya Carnegie Institution
Tutorial on April 19-22, 2004
1 Dr. Matthew Berriman The Wellcome Trust Sanger Institute
T. brucei & L. Major
2 Herbert Chiang Washington University
Bacteroides thetaiotaomicron
3 Clinton Fernandez University of British Columbia
Rhodococcus sp. RHA1 (~10MB)
4 Lisa Koski University of Montreal, Canada
5 Rebecca Krupp UCLA
Methanosarcina acetivorans
6 Joanne Luciano BioPathways Consortium
Prochlorococcus marinus MED4
7 Jasintha Maniraja Universite Libre de Bruxelles
Mus musculus
8 Linyong Mao Pacific Northwest National Laboratory
Shewanella oneidensis
9 Michael P. McLeod University of British Columbia
Rhodococcus sp. RHA1 (~10MB)
10 Dylan Morris CalTech
Mycoplasma genitalium
11 Gavin Murphy CalTech
Bdellovibrio
12 Joo-Heon Park University of Tex-Houston Med School
Treponema pallidum
13 Liviu Popescu Cornell University, Computer Science
Sacaromyces cerevisae
14 Christopher Reigstad Washington University
unpublished uropathogenic E. coli
15 Haluk Resat Pacific Northwest National Laboratory
16 Jian Song Los Alamos National Laboratory
Pseudomonas aeruginosa
GMOD Project Status April 2004 D. Gilbert (gilbertd@indiana.edu)
Project members: Don Gilbert, Josh Goodman, Paul Poole,
Vasanth Singan (student), at Indiana University.
Projects in development for GMOD:
(1) LuceGene, document/object search/retrieval for genome data
www.gmod.org/lucegene/ eugenes.org:8081/gmod/lucegene/
version 1.2 (alpha), released for public use April 2004.
In use at FlyBase.net, euGenes.org, wFleaBase. LuceGene is similar in
concept to the bioinformatic databank access tool SRS, and web search
systems such as Google. Based on Lucene, this Java program is fast and
flexible at search and retrieval of complex data objects. It
outperforms Chado Postgres database by 10x or more at gene object
retrieval.
(2) Genome Directory System, data mining access to genome data
www.gmod.org/gds/
In development, web services for SOAP access to genome data and bio
sequence databanks. Plan to provide production data mining services
through this including FlyBase, euGenes genomes and Bio-Mirror/IUBio
biosequence databanks. Will add to ARGOS package for genome databases.
Includes plan to test FlyBase data analyses over TeraGrid, Fall 2004.
(3) ARGOS, a replicable genome information system
www.gmod.org/argos/ flybase.net/argos/ eugenes.org/argos/
Version 0.7 (alpha, March 2004).
ARGOS is used now for replicating public web-genome databases. Contains
all of FlyBase, euGenes, wFleaBase, and some other services.
Contents include 10 GB multi-genome data (euGenes), 8 GB of Drosophila
(FlyBase), 500 MB common software, servers, binaries).
Miscellany:
gmod/schema/XMLTools/ChadoSax/ reader for chado.xml provides
flybase annotation data access.
gmod/schema/GMODTools/ Perl modules using GMOD 0.001 release for
managing miscellany sequences (EST, GSS, etc) in Chado database
Used now in Daphnia / wFleaBase genome database (eugenes.org/daphnia)
Apollo data search/retrieval system used at
flybase.net/apollo/
a web CGI using Chado Postgres + LuceGene
for retrieval Game XML annotations by
lookup of gene name, genome location, other attributes.
Tested, aided development, and used GMOD release 0.001, Postgres Chado,
XORT, Chado::DBI, GBrowse, etc. tools for FlyBase and wFleaBase, where
they now form the basis of data management.
GMOD Update from the Saccharomyces Genome Database (SGD)
Before the last GMOD meeting at Berkeley, SGD released several GMOD
software packages (Blast Graphic Viewer, Restriction Graphic Viewer and
GO Graphic Viewer). Since then, we have been working on incorporating
existing GMOD products into new tools and resources at SGD. Here is a
list of projects that are currently under development or already in
production.
1. New Fungal BLAST using BLAST Graphic Viewer.
SGD has created a new Fungal BLAST interface using the BLAST Graphic
Viewer. This new tool can be used to do BLASTN or TBLASTN searches using
any sequence of choice against any combination of fungal sequence datasets,
including genome sequences of fungal model organisms and pathogens, ESTs,
and other fungal sequence sets in GenBank. The fungal BLAST search at SGD
can be accessed from this URL.
http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast-fungal.pl
2. GBrowse at SGD
GBrowse has been set up at SGD. SGD is still testing the software
before making a general announcement about the availability of the
software. This software is running on top of a MySQL database whose
tables are populated from a flat file in GFF3 format (refer to the third
topic for detail). GBrowse at SGD can be accessed from this URL.
http://www.yeastgenome.org/cgi-bin/SGD/gbrowse/gbrowse/yeast
3. GFF3 file format
SGD has started to provide the sequence features of S. cerevisiae
genome in a flat file, which is fully compatible with GFF3 format.
This file is used as the data input to load the MySQL database for
GBrowse and the PostgreSQL database running Chado schema for SGD Lite
at Princeton. This file is updated every week on SGD's ftp site. This
file is available for download from this URL.
ftp://genome-ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/SGDGFF3.gff
4. SGD Lite and CHADO
The SGD colony at Princeton has been working on installing GMOD
release 0.002. Both versions of the Chado schema in these releases
(.001 and .002) have been successfully installed and loaded (via a
modified GFF3 file) on a desktop running Mac OS 10.3.2 using the
included installation scripts. We are currently working on installing
0.002, including GBrowse, on an Apple X server running 10.3.2. We plan
to assemble installation notes/documentation and distribute them during
the meeting.
5. Textpresso Beta testing
SGD has a wealth of literature information. We want to provide
expanded text searching to our users, since we have an abstract and/or
full text for most of our references. Textpresso is an information
retrieval system developed by Wormbase at Caltech. Eimear Kenny spent
two weeks at SGD to help set up a test version of Textpresso. The SGD
Textpresso can be accessed from this URL.
http://www.yeastgenome.org/textpresso/
Currently, we are working on improving Textpresso's software
performance, as well as developing a yeast version of the Textpresso
ontology. We improved the performance of the markup script (text2xml.pl)
by 50%. We are also considering a few options to improve the indexing
mechanism. With regard to the ontology, we have modified the 'Gene'
and 'Localization in Time and Space' categories. We are also currently
working on a few other categories, such as Allele, Transgene and
Phenotype, in order to best reflect the biology in S. cerevisiae.