Difference between revisions of "April 2004 GMOD Meeting"
(Replacing page with 'pr03mb4 was here to winz OWned by http://blueentertainment.net/ Category:Apollo Category:CMap Category:Meetings Category:Pathway Tools [[Category:PubSearch...') |
|||
Line 1: | Line 1: | ||
− | [[ | + | Generic Model Organism Database Construction Set |
+ | |||
+ | ==Meeting 4== | ||
+ | |||
+ | GMOD Meeting April, 2004 | ||
+ | |||
+ | ==Presentations== | ||
+ | |||
+ | * [[Media:Cain_040526.ppt|Cain_040526.ppt]] | ||
+ | * [[Media:Crosby_040526.ppt|Crosby_040526.ppt]] | ||
+ | * [[Media:Emmert_040526.ppt|Emmert_040526.ppt]] | ||
+ | * [[Media:Gelbart_040528.ppt|Gelbart_040528.ppt]], Orthology | ||
+ | * [[Media:Gilbert_040526.ppt|Gilbert_040526.ppt]] | ||
+ | * [[Media:Harris_040527.ppt|Harris_040527.ppt]] | ||
+ | * [[Media:Kasprzyk_040526.ppt|Kasprzyk_040526.ppt]] | ||
+ | * [[Media:Kenny_040526.ppt|Kenny_040526.ppt]] | ||
+ | * [[Media:Kodira_040526.ppt|Kodira_040526.ppt]] | ||
+ | * [[Media:Matthews_040526.ppt|Matthews_040526.ppt]] | ||
+ | * [[Media:Sabo_040526.ppt|Sabo_040526.ppt]] | ||
+ | * [[Media:Schlueter_040526.ppt|Schlueter_040526.ppt]] | ||
+ | * [[Media:Terry_040526.ppt|Terry_040526.ppt]] | ||
+ | * [[Media:Worley_040526.ppt|Worley_040526.ppt]] | ||
+ | |||
+ | ==Agenda== | ||
+ | |||
+ | ===April 26, Morning: Combined Developers and Curators section=== | ||
+ | |||
+ | Mount Vernon Room | ||
+ | |||
+ | <table cellpadding="6"> | ||
+ | <tr> | ||
+ | <td width="15%">9:00am</td> | ||
+ | <td width="85%">Introductions<br> | ||
+ | Scott Cain (CSHL)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>9:20</td> | ||
+ | <td><br> | ||
+ | Don Gilbert (FlyBase, Indiana University)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>10:30</td> | ||
+ | <td>Break</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>10:45</td> | ||
+ | <td><br> | ||
+ | Frank Smutniak (FlyBase, Harvard University)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>11:20</td> | ||
+ | <td><br> | ||
+ | Stan Letovsky (FlyBase, Harvard University)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>11:45</td> | ||
+ | <td>Lunch (on your own-many good restaurants-check with a local)</td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | ===April 26, Afternoon: Developer section=== | ||
+ | |||
+ | Mount Vernon Room | ||
+ | |||
+ | <table cellpadding="6"> | ||
+ | <tr> | ||
+ | <td width="15%">1:30</td> | ||
+ | <td width="85%"><br> | ||
+ | Arek Kasprzyk (EBI)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>2:00</td> | ||
+ | <td>GMOD/Turnkey web demo<br> | ||
+ | Brian O'Conner (UCLA)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>2:30</td> | ||
+ | <td>Break</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>3:00</td> | ||
+ | <td><br> | ||
+ | Eimear Kenny (WormBase, CalTech)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>3:30</td> | ||
+ | <td><br> | ||
+ | Toshiaki Katayama (Human Genome Center, University of Tokyo, Japan)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>3:45</td> | ||
+ | <td>Break</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>4:00</td> | ||
+ | <td><br> | ||
+ | David Emmert (FlyBase, Harvard University)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>4:30</td> | ||
+ | <td><br> | ||
+ | Scott Cain</td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | ===April 26, Afternoon: Curator section=== | ||
+ | |||
+ | Terrace Room | ||
+ | |||
+ | 1:30 Jennifer Wortman (The Institute for Genomic Research)<br> | ||
+ | 1:50 Shannon Schlueter (Arabidopsis thaliana Plant Genome Database, Iowa State University)<br> | ||
+ | 2:10 Aniko Sabo (Genome Sequencing Center, Washington University School of Medicine)<br> | ||
+ | 2:30 Break<br> | ||
+ | 2:50 Madeline Crosby (FlyBase, Harvard University)<br> | ||
+ | 3:10 Kim Worley (Human Genome Sequencing Center, Baylor College of Medicine)<br> | ||
+ | 3:30 Astrid Terry (Joint Genome Institute)<br> | ||
+ | 3:50 Break4:10Chinnappa Kodira (Broad Institute)<br> | ||
+ | 4:30 Michele Clamp (Broad Institute)<br> | ||
+ | 4:50 Break<br> | ||
+ | 5:00 Group discussion<br> | ||
+ | 6:00 Dinner (on your own-see above)<br> | ||
+ | |||
+ | ===April 27, Developer section=== | ||
+ | |||
+ | Mount Vernon Room | ||
+ | |||
+ | <table cellpadding="6"> | ||
+ | <tr> | ||
+ | <td width="15%">9:00</td> | ||
+ | <td width="85%">GMOD Alpha release, Part II</td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | <p>The goal here is to try to get the gmod alpha installed on computers to | ||
+ | test the installation and working issues with the release. It is almost | ||
+ | certainly the case that there will also be time for "breakout" sessions | ||
+ | for smaller groups to discuss a variety of topics. Suggestions will be | ||
+ | accepted both before and during the meeting.</p> | ||
+ | |||
+ | ===April 27, Curator section=== | ||
+ | |||
+ | Terrace Room | ||
+ | |||
+ | <table cellpadding="6"> | ||
+ | <tr> | ||
+ | <td width="15%">9:00</td> | ||
+ | <td width="85%">Apollo Demo<br> | ||
+ | Sima Misra (FlyBase & Berkeley Drosophila Genome Center)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>9:40</td> | ||
+ | <td><br> | ||
+ | Nomi Harris (FlyBase & Berkeley Drosophila Genome Center)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>10:00</td> | ||
+ | <td>Hands-on Apollo workshop for curators<br> | ||
+ | Breakout session for Apollo developers</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>12:00</td> | ||
+ | <td>Lunch</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>1:30</td> | ||
+ | <td>Q&A session with curators & Apollo developers</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>2:30</td> | ||
+ | <td>Hands-on Apollo workshop for curators</br> | ||
+ | Breakout session for Apollo developers</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>4:30</td> | ||
+ | <td>Q&A session with curators & Apollo developers<br> | ||
+ | Group discussion</td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | ===April 27, Dinner=== | ||
+ | |||
+ | <table cellpadding="6"> | ||
+ | <tr> | ||
+ | <td width="15%">6:00<td> | ||
+ | <td width="85%"><br> | ||
+ | Reservation on us, food paid for by you</td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | ===April 28, Combined Developer and Curator section=== | ||
+ | |||
+ | Mount Vernon Room | ||
+ | |||
+ | <table cellpadding="6"> | ||
+ | <tr> | ||
+ | <td width="15%">9:00</td> | ||
+ | <td>Updates from the previous day<br> | ||
+ | Scott Cain and Sima Misra</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>10:00</td> | ||
+ | <td>Break</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>10:15</td> | ||
+ | <td><br> | ||
+ | Bill Gelbart (FlyBase, Harvard University)</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>11:15</td> | ||
+ | <td>Break</td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>11:30</td> | ||
+ | <td>Closing remarks, planning for the next meeting<br> | ||
+ | Scott Cain</td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | ==Progress reports== | ||
+ | <pre> | ||
+ | GMOD Progect Progress Reports | ||
+ | April, 2004 | ||
+ | ----------------------------- | ||
+ | |||
+ | The past four months have seen the first two releases of gmod, which will | ||
+ | become the suite of model organism database software. The first release, | ||
+ | version 0.001 (alpha), was release in January, 2004. The main goal of that | ||
+ | release was to establish a release procedure. The release consisted of a | ||
+ | database schema, referred to as chado, which is the database schema | ||
+ | developed primarily by FlyBase developers at Harvard and BDGP. Additionally, | ||
+ | there were a variety of tools for installing and loading data into the | ||
+ | database which were developed primarily by Allen Day at UCLA and Scott | ||
+ | Cain at CSHL. Finally, there was a compatible version of the Generic | ||
+ | Genome Browser with a chado database adaptor developed to allow browsing | ||
+ | of genome features directly from the database. | ||
+ | |||
+ | The second release, also an alpha release, consisted of the same components, | ||
+ | and was release in March, 2004. In this release, the installation procedure | ||
+ | improved considerably, and a prerequisite that had caused testers difficulties | ||
+ | was removed. During the GMOD meeting in April, this release was installed | ||
+ | by several attendees during a workshop. Several suggestions were made that | ||
+ | will be implemented in the next release. | ||
+ | |||
+ | There are several items planned for addition or improvement in the next | ||
+ | two releases. Tools to allow importing and exporting XML formatted data | ||
+ | from chado will be included, which will allow the sequence annotation tool, | ||
+ | Apollo, to be used with chado. Addtionally, template based web front end for | ||
+ | chado called turnkey will be included in an upcoming release. This software is | ||
+ | still early in the development process, but when it was presented to | ||
+ | developers at the GMOD meeting in April, there was considerable interest | ||
+ | in getting it included in a gmod release as soon as possible. | ||
+ | |||
+ | Longer term goals for gmod releases are including pubsearch and pubfetch. | ||
+ | The process of porting these applications has begun and is expected to be | ||
+ | complete by the end of the year. A tool for liturature based sequence | ||
+ | annotation, called JavaSEAN, is expected to be included in gmod in a similar | ||
+ | time frame. Additionally, there are plans from the Apollo developers to | ||
+ | create a new version of Apollo that will be able to read and write directly | ||
+ | to the database without using an XML intermidary, which will simplify the | ||
+ | process of sequence annotation considerably. | ||
+ | |||
+ | |||
+ | |||
+ | Apollo Progress Report (11/2003 - 4/2004) | ||
+ | |||
+ | Major improvements in release 1.3.6 (11/3/03): | ||
+ | |||
+ | Apollo now runs under JDK1.4, which works better on most platforms. | ||
+ | |||
+ | Can rubberband a region on the axis and the selected sequence will pop up | ||
+ | in a Sequence window. | ||
+ | |||
+ | Results that represent hits against sequences that are new to their | ||
+ | respective database (as indicated in tiers file) are shown with a box | ||
+ | around them, so that the curator can immediately see which results are | ||
+ | new and need to be looked at. | ||
+ | |||
+ | Search (Find) now allows full regexps. | ||
+ | |||
+ | Instead of having the config files in $HOME/.apollo be slightly modified | ||
+ | copies of the ones in APOLLO_ROOT/conf, you can now put ONLY the stuff | ||
+ | you want changed into your personal cfg files. Apollo will first read | ||
+ | the ones in APOLLO_ROOT/conf, and then read your personal cfgs and apply | ||
+ | any modifications. | ||
+ | |||
+ | Synteny (see Synteny section at end) | ||
+ | |||
+ | |||
+ | Major improvements in release 1.4.0 (internal release) (2/9/2004): | ||
+ | |||
+ | New game.tiers file format (easier to read and change). If you have an | ||
+ | old game.tiers, it will be autoconverted to the new format. | ||
+ | |||
+ | Better handling of non-gene annotation types. New glyphs for showing | ||
+ | them in main Apollo display. | ||
+ | New annotations are automatically assigned the type (e.g. gene, tRNA, | ||
+ | etc.) appropriate to the evidence that was used to create them. (Type | ||
+ | can then be changed in the annotation info editor, if desired.) | ||
+ | |||
+ | Structured transaction records are now added to the XML when you save. | ||
+ | They include the type of object that changed (e.g. TRANSCRIPT; | ||
+ | ANNOTATION; COMMENT), the operation (e.g. ADD, SPLIT, etc.), the relevant | ||
+ | names and/or IDs before and after the transaction, and the user and | ||
+ | time/date when the change was made. | ||
+ | |||
+ | Support for translational exceptions, including frame shifts and one base | ||
+ | pair genomic sequencing errors. | ||
+ | |||
+ | UTRs are now shown in a different (configurable) color from the rest of | ||
+ | the gene. | ||
+ | |||
+ | Restriction enzyme mapper: | ||
+ | - Cut sites show up in main window (near the axis) | ||
+ | - Can now map multiple restriction enzymes at once | ||
+ | - Table of restriction fragments; can be selected for viewing in | ||
+ | Sequence window | ||
+ | |||
+ | Annotation info window: | ||
+ | - Now has integrated annotation tree | ||
+ | - Shows arbitrary properties for annotations and transcripts (including | ||
+ | validation_flag) | ||
+ | - Shows translational exceptions and genomic sequencing errors | ||
+ | - Lets you edit annotation ID as well as name/symbol | ||
+ | |||
+ | Ability to tag results by selecting from a list of comments, which are | ||
+ | specified (as ResultTags) in game.style. Tagged results are crosshatched | ||
+ | in pink in the display. | ||
+ | |||
+ | Fixed updating of peptide sequences. | ||
+ | |||
+ | |||
+ | Improvements in releases 1.4.1 (3/12/04) and 1.4.2 (3/18/04): | ||
+ | |||
+ | Red/green markers at axis show where sequence/region ends. | ||
+ | |||
+ | To help you identify splice sites that are unconventional, colored | ||
+ | triangles appear in the annotation glyph. | ||
+ | |||
+ | Can now load D. melanogaster data from r3.1 (gadfly) and r3.2 (chado) | ||
+ | (both via cgi). | ||
+ | |||
+ | |||
+ | 1.4.3 (4/19/04): | ||
+ | Let users get the sequence of the entire segment you're looking at, not | ||
+ | just a rubberbanded section. [File -> Save sequence] | ||
+ | |||
+ | |||
+ | |||
+ | Synteny progress, 11/03-4/04: | ||
+ | |||
+ | - Synteny now works with GAME. You can load one species and then use the | ||
+ | blast or syntenic block results to another species (for now it's pseudo) | ||
+ | to load another species. The other species is loaded with the same range | ||
+ | around that feature. Links between the two species are automatically | ||
+ | derived from the blast link features that are present in both datasets | ||
+ | (no explicit link file needs to be specified). | ||
+ | |||
+ | - Database chooser was added to select the different species databases. | ||
+ | |||
+ | - Able to switch back and forth from synteny data adapter to regular data | ||
+ | adapters without restarting Apollo. | ||
+ | |||
+ | - Can save and edit (edit could use some rigorous testing) | ||
+ | |||
+ | - Can home in on link from link popup menu. Zooms and shows the strands | ||
+ | of homed in link, strands not in link are hidden. | ||
+ | |||
+ | - Species now zoom and scroll together by default. Can unlock zoom with | ||
+ | shift key, and unlock scroll with menu item. | ||
+ | |||
+ | - You can now config links between 2 curation sets that contain links to | ||
+ | each other. link_type, source and hit species are specified in the linked | ||
+ | type in the tiers file. This works with game, in theory could be made to | ||
+ | work with other adapters that have linked data embedded in the species | ||
+ | data. | ||
+ | |||
+ | |||
+ | |||
+ | Textpresso: A progress report | ||
+ | |||
+ | Eimear Kenny, Hans-Michael Mueller and Paul Sternberg | ||
+ | |||
+ | Updates made to Textpresso since September 2003: | ||
+ | |||
+ | Textpresso for Yeast Literature | ||
+ | (Toward a generic MOD information retrieval/extraction search engine) | ||
+ | |||
+ | SGD developers and curators met with Eimear Kenny for two weeks at | ||
+ | the begining of March at Stanford to build a Textpresso search engine for | ||
+ | Yeast. During that period the Textpresso software was installed on a | ||
+ | Solaris system and three builds with a test corpus of ~400 full text | ||
+ | journal articles were completed. In addition, the Textpresso Ontology for | ||
+ | worm literature was modified to a functional preliminary ontology for | ||
+ | yeast literature. Plans to expand the corpus to 10,000 yeast papers and | ||
+ | make improvements to the yeast ontology are underway at Stanford. | ||
+ | |||
+ | Integration of Textpresso into Literature Curation Pipeline | ||
+ | |||
+ | We have integrated Textpresso to the Wormbase curation pipeline | ||
+ | to expediate the extraction of genetic interaction information from the | ||
+ | literature. A prototype curation interface has been developed to enable a | ||
+ | curator to extract data from sentences returned by a Textpresso query for | ||
+ | genetic interaction. We find that these Textpresso sentences are enriched | ||
+ | 3-fold for gene-gene interactions compared to sentences that mention two | ||
+ | or more gene names and 39-fold compared to random sentences from the | ||
+ | literature. | ||
+ | |||
+ | Textpresso MOD interface | ||
+ | |||
+ | We have generated a Wormbase-like interface for Textpresso to integrate | ||
+ | the Textpresso information retrieval engine in the Wormbase web-site. | ||
+ | http://www.textpresso.org/cgi-bin/wb/textpressoforwormbase.cgi?allabstracts=on&searchmode=sentence&searchtargets=Paper&searchtargets=Abstract | ||
+ | |||
+ | Textpresso Package | ||
+ | |||
+ | Hans-Michael Mueller is working on packaging Textpresso for release in | ||
+ | the first half of this year. | ||
+ | |||
+ | Textpresso paper ... under review | ||
+ | |||
+ | A Textpresso publication is currently under revision. | ||
+ | |||
+ | |||
+ | |||
+ | PubFetch/PubTrack Progress Report (April 2004) | ||
+ | |||
+ | PubFetch | ||
+ | PubFetch is a tool for accessing literature from various online resources. | ||
+ | The goal is to provide a common interface and common format to downstream | ||
+ | applications to allow them to query different literature repositories in | ||
+ | a single, unified fashion. | ||
+ | |||
+ | PubFetch has been implemented in two forms: | ||
+ | * Java servlet core + simple web interface to provide interactive access | ||
+ | to PubFetch | ||
+ | * Provides access to PubMed and Agricola databases | ||
+ | * BioMOBY wrapper around servlet core to provide webservice access to PubFetch | ||
+ | |||
+ | A variety of new features have been introduced: | ||
+ | * Duplicate filtering - running the same search on multiple data sources | ||
+ | results in some duplication of articles, the duplicate filter detects | ||
+ | these articles returning a non-redundant set of data. Database Ids from | ||
+ | both sources are maintained in the non-redundant set. | ||
+ | * The web interface version highlights keywords in the search results to | ||
+ | aid in review of the returned articles. | ||
+ | * Connection to full text - a hyperlink to the full text is returned (if | ||
+ | available from PubMed) | ||
+ | * Filtering of 'ahead of print' articles - Abstracts are appearing in | ||
+ | PubMed and being assigned PubMed Ids prior to being published and are | ||
+ | being reassigned PubMed Ids after publication. PubFetch allows filtering | ||
+ | of these ahead of print articles to retrieve only published articles. | ||
+ | |||
+ | The BioMOBY interface provides the following services: | ||
+ | * SearchPubmed - Search PubMed for given query and get PMIDs | ||
+ | * GetPubmed - Retrieve PubMed articles in MEDLINE display format for given | ||
+ | PMIDs | ||
+ | * FetchFull - Get FullText for given PMID | ||
+ | * fetchAgID - Search Agricola for given query and get Agricola accession number | ||
+ | * fetchAgDoc - Get Agricola document in MEDLINE like format for given | ||
+ | Agricola accession number | ||
+ | |||
+ | Current work | ||
+ | The integration of PubFetch and PubSearch is in progress, our goal is to | ||
+ | have PubSearch using the PubFetch core module for literature retrieval by | ||
+ | summer of 2004. We will be adapting the Rat Genome Database literature | ||
+ | pipeline to use the PubFetch BioMOBY services to act as its source for | ||
+ | literature data download. | ||
+ | |||
+ | The current version of PubFetch is available from the GMOD cvs: | ||
+ | http://cvs.sourceforge.net/viewcvs.py/gmod/pubfetch/ | ||
+ | |||
+ | Implementation of PubSearch at RGD | ||
+ | Following a curator review of existing PubSearch functionality, a variety | ||
+ | of new features were requested by the RGD curators to enable a more | ||
+ | 'article-centric' view of the PubSearch database. This has been | ||
+ | implemented by the TAIR group and plans are underway to install this | ||
+ | latest version of PubSearch at RGD, populate with RGD/Rat data and test | ||
+ | in the RGD curation process. | ||
+ | |||
+ | PubTrack | ||
+ | PubTrack is a monitoring tool that tracks objects as they move through | ||
+ | a process or workflow. Existing workflow tools move data through a | ||
+ | specified process, passing datasets to applications and retrieving | ||
+ | results and passing them to the next step in the flow. PubTrack does | ||
+ | not aim to direct or control workflow and it does not track the dataset | ||
+ | as a whole, it provides a higher resolution and tracks the data objects | ||
+ | within the dataset, enabling users to follow a particular object as it | ||
+ | moves through a process. | ||
+ | |||
+ | Progress to date: | ||
+ | * Review of existing workflow tools and schemas has been completed. | ||
+ | * The initial PubTrack schema has been developed and implemented in PostgreSQL | ||
+ | * Initialization scripts have been written to populate the PubTrack | ||
+ | database with initial object and process data. Perl scripts are used to | ||
+ | parse and load initialization data in a standard XML format; a DTD is | ||
+ | available and is used to confirm the data formatting. | ||
+ | * An API is under development to allow 3rd party applications to | ||
+ | communicate with PubTrack to initialize and update the tracking | ||
+ | information for objects under observation. This is being developed and | ||
+ | tested using data from a proteomics MS/MS analysis pipeline that is | ||
+ | being built in my lab. | ||
+ | * A basic web user interface is in development to provide end-users with | ||
+ | the ability to view objects and their progress through their designated | ||
+ | processes. | ||
+ | * The concept of 'estimated time of completion' has been added to allow | ||
+ | long term planning and project tracking. For example, the entire process | ||
+ | of curating an article might typically take 3 days, so the estimated time | ||
+ | of completion would be 3 days after the start of curation. This estimate | ||
+ | can be displayed on a Gantt chart and updated as individual steps in the | ||
+ | process are completed, allowing an increasingly refined view of the | ||
+ | completion date. This is being used in our proteomics tracking - component | ||
+ | 1 generates tissue samples from animals in a process that takes upto 3 | ||
+ | weeks to complete. By tracking the progress and updating the completion | ||
+ | time estimate using PubTrack it allows lab members in component 2 to plan | ||
+ | ahead. They are able to see what samples will be ready and on what date | ||
+ | they will be ready and this is updated as the process progresses. | ||
+ | |||
+ | Current Work | ||
+ | When the API is stabilized we will deploy PubTrack in the existing RGD | ||
+ | literature curation pipeline and ultimately in combination with PubSearch | ||
+ | at RGD. This will create an entire system allowing tracking of literature | ||
+ | across a heterogeneous system as it is downloaded from PubMed, into | ||
+ | PubSearch, screened, moved to RGD's Oracle db, curated and ultimately | ||
+ | filed. A more comprehensive user interface will be developed based on the | ||
+ | experiences from the proteomics pipeline and the RGD curation pipeline. | ||
+ | The goal is to provide generic tracking views and a way to allow specific | ||
+ | users to customize the displays, charts and reports if needed. | ||
+ | |||
+ | PubTrack documents including schema, loading scripts, etc. can be found on | ||
+ | the GMOD CVS. | ||
+ | http://cvs.sourceforge.net/viewcvs.py/gmod/pubtrack/ | ||
+ | |||
+ | |||
+ | |||
+ | PubSearch update | ||
+ | |||
+ | We've migrated our database schema over to one that should be more | ||
+ | compatible with a Chado schema --- all of our table names are now prefixed | ||
+ | with a 'pub_' prefix, and we've done some column renaming so that we use | ||
+ | consistant names throughout the system. | ||
+ | |||
+ | Our production server has been also upgraded from MySQL3 to MySQL4, and | ||
+ | we've rewritten some parts of Pubsearch to take advantage of the | ||
+ | transaction support that the new MySQL provides. We've also added | ||
+ | referential integrity constraints to the foreign keys in our tables. | ||
+ | |||
+ | We've adopted another tool called JCoverage to help us identify areas of | ||
+ | our code that are not being touched by our unit cases, and have started to | ||
+ | tighten up our test cases so that our major classes are being exercised. | ||
+ | |||
+ | We've worked toward removing dependencies on external resources. Hit | ||
+ | generation now works directly from the Java codebase, rather than from an | ||
+ | external Python script. We've continued work on a keyword term browser to | ||
+ | replaced the highly munged version of AmiGO that we are running locally. | ||
+ | |||
+ | |||
+ | |||
+ | GBrowse Project | ||
+ | |||
+ | Coordinator: Lincoln Stein | ||
+ | Major Developers: Scott Cain | ||
+ | Aaron Mackey | ||
+ | Toshiaki Katayama | ||
+ | Vsevolod Ilyushchenko | ||
+ | Marc Logghe | ||
+ | Sheldon McKay | ||
+ | Mark Wilkinson | ||
+ | |||
+ | DESCRIPTION: | ||
+ | |||
+ | GBrowse is a web-based browser for genome annotations. It is intended to | ||
+ | complement Apollo by providing a search, browse and drill-down display for | ||
+ | sequence-based features without the need for prior software installation. | ||
+ | GBrowse uses a database adaptor system to connect to a single primary data | ||
+ | source, and a temporary flat-file system to layer an arbitrary number of | ||
+ | third-party annotations on top of the primary data. A plugin system is used | ||
+ | to add new functionality to gbrowse, such as more advanced searches, and | ||
+ | dynamically-computed features such as ab initio gene predictions. An | ||
+ | internationalization layer allows GBrowse to display button labels, menus and | ||
+ | help text in a variety of common world languages. | ||
+ | |||
+ | The following gbrowse database adaptors currently exist: | ||
+ | |||
+ | Bio::DB::GFF (oracle, postgresql & mysql) | ||
+ | Well-tested and in production. | ||
+ | |||
+ | Bio::DB::Das::Chado (postgresql) | ||
+ | Well-tested and in early production. | ||
+ | |||
+ | GenBank proxy | ||
+ | Well-tested and in production. Does not handle | ||
+ | full-genbank keyword searches properly. | ||
+ | |||
+ | Bio::DB::Das::BioSQL | ||
+ | Adaptor for the BioSQL schema. In beta test. | ||
+ | |||
+ | Bio::Das | ||
+ | Adaptor for DAS sources. Released, but probably best | ||
+ | considered in beta test. | ||
+ | |||
+ | GBrowse has been downloaded from SourceForge 1,830 times, but this is | ||
+ | a poor way to count the number of GBrowse users. A more conservative | ||
+ | estimate of users comes from tallying bug reports, which ensures that | ||
+ | the user has at least tried to install the software. However, it | ||
+ | represents an undercount. In any case, we can confirm that at least | ||
+ | 100 laboratories have installed GBrowse. As the list attached to the | ||
+ | bottom of this report shows, GBrowse can be found in academic, | ||
+ | governmental and commercial organizations in North America, South | ||
+ | America, Europe, Asia, Africa and Australia. | ||
+ | |||
+ | RECENT PROGRESS: | ||
+ | |||
+ | Since the last status report, we have added the following features to | ||
+ | GBrowse: | ||
+ | |||
+ | 1) SVG output | ||
+ | |||
+ | Users can now click on a link labeled "Publication Quality Image" and | ||
+ | download a Scaleable Vector Graphics version of the current view. SVG | ||
+ | is an editable format that can be manipulated with popular graphics | ||
+ | programs such as Adobe Illustrator, and can be reprinted by journals | ||
+ | without the pixelation that plagues bitmapped images. | ||
+ | |||
+ | 2) Security | ||
+ | |||
+ | Tracks can now be protected by username & password, restricted to | ||
+ | certain hosts, or limited to hosts presenting certain classes of RSA | ||
+ | (digital) certificates. A restricted track does not appear on the | ||
+ | screen of unauthorized users, allowing system administrations to | ||
+ | present a mix of proprietary and public data. | ||
+ | |||
+ | 3) DAS support | ||
+ | |||
+ | GBrowse can now run on top of distributed annotation system sources. | ||
+ | DAS is supported in three ways: | ||
+ | a) As an external annotation source | ||
+ | Users can layer remote DAS tracks on top of the current view. | ||
+ | The remote DAS tracks will remain active from session to | ||
+ | session. The GBrowse administrator can preconfigure a set | ||
+ | of "recommended" DAS sources, which will then appear in a | ||
+ | user-selectable menu. | ||
+ | |||
+ | b) As a primary database | ||
+ | GBrowse can now be configured to use a local or remote DAS | ||
+ | database as its primary data source. This means that one | ||
+ | can point GBrowse at the UCSC or ENSEMBL databases and | ||
+ | immediately begin browing them using the GBrowse user | ||
+ | interface. | ||
+ | |||
+ | c) As a DAS source | ||
+ | GBrowse will act as a DAS server. At the administrator's | ||
+ | discretion, all or selected tracks can be made exportable | ||
+ | via DAS, allowing sequence features be shared between | ||
+ | GBrowse instances or between GBrowse and other DAS clients. | ||
+ | |||
+ | 4) Feature filtering and highlighting | ||
+ | |||
+ | A new filtering and highlighting API allows plugins to hide features | ||
+ | based on a set of user-supplied criteria or to highlight them in | ||
+ | various colors. | ||
+ | |||
+ | 5) New adaptors | ||
+ | |||
+ | In addition to the DAS adaptor, we have added an experimental BioSQL | ||
+ | adaptor to GBrowse. BioSQL is a flexible database schema designed by | ||
+ | the BioPerl & BioJava projects for the purposes of holding | ||
+ | GenBank/EMBL records in a relational format. | ||
+ | |||
+ | 6) Support for GFF3 loading & dumping | ||
+ | |||
+ | GBrowse now can load and dump sequence annotations in GFF3 format | ||
+ | (http://song.sourceforge.net), a preliminary specification that | ||
+ | improves on the current GFF sequence feature format. The advantage of | ||
+ | this format is that it uses the Sequence Ontology, a controlled | ||
+ | vocabulary of sequence feature types. | ||
+ | |||
+ | 7) Integrated MOBY support | ||
+ | |||
+ | The BioMOBY system (www.biomoby.org) is a web services system that | ||
+ | allows users to quickly locate and invoke bioinformatics services. | ||
+ | GBrowse now has an interface which allows it to find services that | ||
+ | will operate on selected sequence features. For example, GBrowse can | ||
+ | present users with a list of current services that will operate on | ||
+ | Drosophila gene names. | ||
+ | |||
+ | 8) Support for writeback | ||
+ | |||
+ | A writeback layer has been added to GBrowse to allow external editors | ||
+ | to update the underlying database. This has been tested successfully | ||
+ | with the Artemis editor in the context of a USDA pathogens database | ||
+ | project. Testing with Apollo is still underway. Currently it is | ||
+ | recommended to edit sequence databases via the shared Chado schema and | ||
+ | the Apollo->Chado->GBrowse route, rather than to use Apollo->GBrowse | ||
+ | directly. | ||
+ | |||
+ | 9) New glyphs | ||
+ | |||
+ | We have recently added a number of new glyphs for use with the | ||
+ | International HapMap Project. New glyphs include a "weighted allele" | ||
+ | glyph that indicates the major and minor alleles of a single | ||
+ | nucleotide polymorphism, and a set of glyphs for visualizing haplotype | ||
+ | blocks. | ||
+ | |||
+ | 10) Bug fixes | ||
+ | |||
+ | Performance has been improved when uploading large 3d party annotation | ||
+ | files. Nucleotide-level alignments have been fixed when the display | ||
+ | is "flipped." The feature name search methods have been cleaned up to | ||
+ | provide more consistent behavior. | ||
+ | |||
+ | PLANS FOR THE FUTURE: | ||
+ | |||
+ | Performance is a concern when viewing large numbers of uploaded | ||
+ | third-party features. We plan to fix this by implementing a indexed | ||
+ | flat file cache for uploaded features. | ||
+ | |||
+ | The user interface needs to be improved in some respects. One useful | ||
+ | idea is to place an icon to the left of each track to indicate whether | ||
+ | it is in a expanded or collapsed state. | ||
+ | |||
+ | The ability to use a different DAS source for each track, which is a | ||
+ | feature of ISB GBrowse, will be ported over. | ||
+ | |||
+ | As always, we are looking for volunteers fluent in non-English | ||
+ | languages to create and update the internationalization files. | ||
+ | |||
+ | Contact: Lincoln Stein <lstein@cshl.org> | ||
+ | |||
+ | APPENDIX. Confirmed users of GBrowse: | ||
+ | |||
+ | Agricultural Biotechnology Center, Hungary | ||
+ | BAWI, S. Korea | ||
+ | Baylor College of Medicine | ||
+ | Biocrates GmbH, Innsbruck | ||
+ | Brandeis University | ||
+ | Bristol-Meyers Squibb | ||
+ | British Columbia Centre for Diseaes Control | ||
+ | CIRAD, France | ||
+ | CSIRO, Australia | ||
+ | Cambridge University (multiple labs) | ||
+ | Center for Genomics & Bioinformatics, Stockholm | ||
+ | Center for Genomics and Bioinformatics, Stockholm | ||
+ | Centre de Genetique Moleculaire, CNRS | ||
+ | Cold Spring Harbor Laboratory (multiple labs) | ||
+ | Compugen | ||
+ | Concordia University, Canada | ||
+ | Cornell Medical School | ||
+ | Cornell University | ||
+ | DNA Landmarks, Inc. | ||
+ | Donald Danforth Plant Sciences Center | ||
+ | Duke University (multiple labs) | ||
+ | EMBL, Heidelberg | ||
+ | EuGenes (hacked copy) | ||
+ | Faculdade de Medicina de Ribeiro Preto, So Paulo | ||
+ | FlyBase | ||
+ | Foundation for Research and Technology, Crete | ||
+ | Fundao Hemocentro, Sao Paolo | ||
+ | Genoscope, France | ||
+ | GrainGenes | ||
+ | Harvard University | ||
+ | Hospital for Sick Kids, Toronto | ||
+ | Illinois Institute of Technology | ||
+ | Incyte Corporation | ||
+ | Inpharmatica, Ltd. | ||
+ | Institute for Systems Biology, Seattle | ||
+ | Institute of Molecular and Cell Biology, Singapore | ||
+ | International Rice Research Institute, Phillipines | ||
+ | John Innes Centre | ||
+ | KEGG | ||
+ | Kansas State University | ||
+ | Karolinska Institute | ||
+ | Kennedy Krieger Institute | ||
+ | Lawrence Berkeley Laboratories | ||
+ | Marine Biological Laboratories, Woods Hole | ||
+ | Massachusetts Institute of Technology (multiple labs)\ | ||
+ | Mayo Institute | ||
+ | McGill University | ||
+ | Meat Animal Research Center, University of Nebraska | ||
+ | Medical University of South Carolina | ||
+ | Michigan State University | ||
+ | NHGRI, NIH | ||
+ | National Cancer Institute, Frederick Cancer Center | ||
+ | New York University (multiple labs) | ||
+ | North Carolina State University | ||
+ | Northern Illinois University | ||
+ | Northwestern University | ||
+ | Oklahoma State University | ||
+ | Open Informatics Consulting Corp. | ||
+ | Oxagen Corp. | ||
+ | Pasteur Institute, Paris | ||
+ | Pioneer Corporation | ||
+ | QIAGEN Operon Corp. | ||
+ | RIKEN (multiple labs) | ||
+ | RatDB | ||
+ | Regulome, Inc. | ||
+ | Rhobio (Bayer CropScience SA & Biogemma joint venture) | ||
+ | Rigshospitalet, Copenhagen | ||
+ | Rockefeller University | ||
+ | Roslin Institute, Edinburgh | ||
+ | Russian Academy Medical Sciences | ||
+ | Serono International Corp, Geneva | ||
+ | Simon Frasier University | ||
+ | South Africa National Bioinformatics Institute | ||
+ | Southern Illinois University | ||
+ | St. Jude Children's Research Hospital, Memphis | ||
+ | Stowers Institute for Medical Research | ||
+ | Texas A&M (multiple labs) | ||
+ | The Institute for Genome Research | ||
+ | Tulane University | ||
+ | Tulane University | ||
+ | University California Davis | ||
+ | University of Arizona (multiple labs) | ||
+ | University of British Columbia | ||
+ | University of California Santa Barbara | ||
+ | University of Georgia (multiple labs) | ||
+ | University of Minnesota | ||
+ | University of Muenster | ||
+ | University of New South Wales, Australia | ||
+ | University of Oklahoma (multiple labs) | ||
+ | University of Pennsylvania (multiple labs) | ||
+ | University of Southern California | ||
+ | University of Texas | ||
+ | University of Toronto | ||
+ | University of Virginia | ||
+ | University of Washington | ||
+ | Universitt Giessen | ||
+ | Universit de Lige, Belgium | ||
+ | Wageningen Universiteit & Researchcentrum, Netherlands | ||
+ | Washington University at St. Louis (multiple labs) | ||
+ | WormBase | ||
+ | deVGen, Belgium | ||
+ | |||
+ | |||
+ | CMAP | ||
+ | Main developer: Ken Clark | ||
+ | |||
+ | Recent improvements include: | ||
+ | |||
+ | * Now CGI-based (no more mod_perl dependencies), making installation | ||
+ | much easier (and much more like Gbrowse) | ||
+ | * Added SVG output | ||
+ | * Added multiple aliases for features | ||
+ | * Added support for arbitrary attributes for db objects | ||
+ | * New cross-reference scheme allows for unlimited xrefs on most db objects | ||
+ | * Experimental XML export/import of data added | ||
+ | * User tutorial added | ||
+ | * Faster, fewer bugs, etc. | ||
+ | |||
+ | CMAP is known to be in use by: | ||
+ | |||
+ | Barry Marler (Andy Paterson), Alex Feltus, Pratt: UGA | ||
+ | Rex Nelson, Chet Langin, Xiaokang Pan: Iowa State | ||
+ | Michelle Bobo: Oregon Health & Science University | ||
+ | Victor Ulat, Richard Bruskiewich: IRRI | ||
+ | Matthew Hobbs: University of Sydney (Australia) | ||
+ | |||
+ | |||
+ | |||
+ | Pathway Tools Status Report | ||
+ | Peter Karp | ||
+ | February 5, 2004 | ||
+ | |||
+ | Please note that the full history of updates to Pathway Tools can be | ||
+ | found at URL | ||
+ | http://bioinformatics.ai.sri.com/ptools/release-notes.html | ||
+ | |||
+ | Significant updates funded under this grant since the last report are | ||
+ | as follows. | ||
+ | |||
+ | o We have implemented the proposed Napster-like peer-to-peer sharing | ||
+ | of Pathway/Genome Databases via a central network registry server. | ||
+ | Pathway Tools users will be able to use the software to register new | ||
+ | PGDBs that they create to this central registry server at SRI, and | ||
+ | they will be able to use the software to browse the registry and | ||
+ | to retrieve and install PGDBs listed there for local analysis. | ||
+ | |||
+ | o Pathway Tools has been extended to support annotation of protein | ||
+ | domains, sites, and chemical modifications. We have created an | ||
+ | ontology of domain, sites, and modification types. The Pathway/Genome | ||
+ | Editor tools have been extended to allow users to interactively | ||
+ | annotate these features on protein sequences, and the Pathway/Genome | ||
+ | Navigator has been extended to display these annotated features. | ||
+ | |||
+ | o We have added a batch-processing mode to the portion of Pathway Tools | ||
+ | that creates new Pathway/Genome Databases to allow large-scale automated | ||
+ | processing of multiple genomes without manual intervention. We have | ||
+ | undertaken a collaboration with the European Bioinformatics Institute, | ||
+ | who are interested in applying Pathway Tools to generate Pathway/Genome | ||
+ | Databases for a large number of genomes. | ||
+ | |||
+ | o We have integrated an algorithm for pathway hole filling into | ||
+ | Pathway Tools. A pathway hole is a reaction step in a metabolic | ||
+ | pathway for which no enzyme has been identified in the genome of | ||
+ | an organism. The pathway hole filler uses a combination of techniques | ||
+ | to predict which genes in the genome code for these missing enzymes. | ||
+ | [This algorithm developed under separate funding.] | ||
+ | |||
+ | o We have completely re-designed the menus of the desktop version | ||
+ | of Pathway/Genome Navigator to be more consistent with other | ||
+ | graphical interfaces, more intuitive to the user, and to provide | ||
+ | more screen area to display of visualizations. | ||
+ | |||
+ | o We have integrated an SBML (Systems Biology Markup Language) output | ||
+ | tool written in the Church lab at Harvard into Pathway Tools, allowing | ||
+ | the reaction network within a Pathway/Genome Database to be exported | ||
+ | to SBML format, from which it can be imported into a number of | ||
+ | simulation and analysis software packages. | ||
+ | |||
+ | o We have reworked the display of information about protein complexes | ||
+ | within Pathway Tools to increase the clarity of this information. | ||
+ | |||
+ | o The preceding capabilities will be present in the February release | ||
+ | of Pathway Tools. | ||
+ | |||
+ | o We have received many emails from users reporting bugs, and asking for | ||
+ | information. | ||
+ | |||
+ | o 80 groups have licensed Pathway Tools to date. | ||
+ | |||
+ | o Pathway/Genome Databases available through the web include: | ||
+ | |||
+ | o Saccharomyces cerevisiae, Stanford University | ||
+ | http://pathway.yeastgenome.org/biocyc/ | ||
+ | |||
+ | o Plasmodium falciparum, Stanford University | ||
+ | plasmocyc.stanford.edu | ||
+ | |||
+ | o Mycobacterium tuberculosis, Stanford University | ||
+ | BioCyc.org | ||
+ | |||
+ | o Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington | ||
+ | Arabidopsis.org:1555 | ||
+ | |||
+ | o Methanococcus janaschii, EBI | ||
+ | Maine.ebi.ac.uk:1555 (availability intermittent) | ||
+ | |||
+ | |||
+ | Pathway Tools Status Report | ||
+ | Peter Karp | ||
+ | April 20, 2004 | ||
+ | |||
+ | Please note that the full history of updates to Pathway Tools can be | ||
+ | found at URL | ||
+ | http://bioinformatics.ai.sri.com/ptools/release-notes.html | ||
+ | |||
+ | Significant updates funded under this grant since the last report in | ||
+ | February 2004 are as follows. | ||
+ | |||
+ | o Version 8.0 of Pathway Tools was released on March 12, 2004. | ||
+ | SRI continues to hold to our planned schedule of two releases of | ||
+ | Pathway Tools per year. | ||
+ | |||
+ | o 275 groups have licensed Pathway Tools to date. The large jump | ||
+ | in this number since the last report reflects the fact that these | ||
+ | numbers also include groups who use Pathway Tools to query | ||
+ | existing Pathway/Genome Databases (not reported earlier), in addition | ||
+ | to groups who use it to create new databases. | ||
+ | |||
+ | o We have made very significant progress on development of an | ||
+ | algorithm to automatically lay out the one-page metabolic overview | ||
+ | diagram that shows the full metabolic network of an organism -- the | ||
+ | algorithm is now working. We are also in the process of adding new | ||
+ | components of the cellular machinery to this diagram. | ||
+ | |||
+ | o SRI has hosted two 4-day training sessions for Pathway Tools. | ||
+ | The dates and 26 attendees are listed below. Most attendees have | ||
+ | brought genomes with them to the training sessions, and have left | ||
+ | with draft Pathway/Genome Databases. | ||
+ | |||
+ | Tutorial on March 15-18, 2004 | ||
+ | |||
+ | 1. John Burke Biotique Inc. | ||
+ | 2. Guillaume Meurice Pasteur Institute | ||
+ | 3. David Simon Pasteur Institute | ||
+ | 4. Gregory P. Fournier MIT | ||
+ | 5. Alex Picone Biatech | ||
+ | 6. John Bashkin SRI | ||
+ | 7. Tit Yee wong University of Memphis | ||
+ | 8. Ken Kaufman UC Berkeley | ||
+ | 9. Jeremy Glasner University of Wisconsin | ||
+ | 10. Lisa Herron-Olson University of Minnesota | ||
+ | 11. Devaki Bhaya Carnegie Institution | ||
+ | |||
+ | |||
+ | Tutorial on April 19-22, 2004 | ||
+ | |||
+ | 1 Dr. Matthew Berriman The Wellcome Trust Sanger Institute | ||
+ | T. brucei & L. Major | ||
+ | 2 Herbert Chiang Washington University | ||
+ | Bacteroides thetaiotaomicron | ||
+ | 3 Clinton Fernandez University of British Columbia | ||
+ | Rhodococcus sp. RHA1 (~10MB) | ||
+ | 4 Lisa Koski University of Montreal, Canada | ||
+ | 5 Rebecca Krupp UCLA | ||
+ | Methanosarcina acetivorans | ||
+ | 6 Joanne Luciano BioPathways Consortium | ||
+ | Prochlorococcus marinus MED4 | ||
+ | 7 Jasintha Maniraja Universite Libre de Bruxelles | ||
+ | Mus musculus | ||
+ | 8 Linyong Mao Pacific Northwest National Laboratory | ||
+ | Shewanella oneidensis | ||
+ | 9 Michael P. McLeod University of British Columbia | ||
+ | Rhodococcus sp. RHA1 (~10MB) | ||
+ | 10 Dylan Morris CalTech | ||
+ | Mycoplasma genitalium | ||
+ | 11 Gavin Murphy CalTech | ||
+ | Bdellovibrio | ||
+ | 12 Joo-Heon Park University of Tex-Houston Med School | ||
+ | Treponema pallidum | ||
+ | 13 Liviu Popescu Cornell University, Computer Science | ||
+ | Sacaromyces cerevisae | ||
+ | 14 Christopher Reigstad Washington University | ||
+ | unpublished uropathogenic E. coli | ||
+ | 15 Haluk Resat Pacific Northwest National Laboratory | ||
+ | 16 Jian Song Los Alamos National Laboratory | ||
+ | Pseudomonas aeruginosa | ||
+ | |||
+ | |||
+ | |||
+ | GMOD Project Status April 2004 D. Gilbert (gilbertd@indiana.edu) | ||
+ | |||
+ | Project members: Don Gilbert, Josh Goodman, Paul Poole, | ||
+ | Vasanth Singan (student), at Indiana University. | ||
+ | |||
+ | Projects in development for GMOD: | ||
+ | |||
+ | (1) LuceGene, document/object search/retrieval for genome data | ||
+ | www.gmod.org/lucegene/ eugenes.org:8081/gmod/lucegene/ | ||
+ | version 1.2 (alpha), released for public use April 2004. | ||
+ | In use at FlyBase.net, euGenes.org, wFleaBase. LuceGene is similar in | ||
+ | concept to the bioinformatic databank access tool SRS, and web search | ||
+ | systems such as Google. Based on Lucene, this Java program is fast and | ||
+ | flexible at search and retrieval of complex data objects. It | ||
+ | outperforms Chado Postgres database by 10x or more at gene object | ||
+ | retrieval. | ||
+ | |||
+ | (2) Genome Directory System, data mining access to genome data | ||
+ | www.gmod.org/gds/ | ||
+ | In development, web services for SOAP access to genome data and bio | ||
+ | sequence databanks. Plan to provide production data mining services | ||
+ | through this including FlyBase, euGenes genomes and Bio-Mirror/IUBio | ||
+ | biosequence databanks. Will add to ARGOS package for genome databases. | ||
+ | Includes plan to test FlyBase data analyses over TeraGrid, Fall 2004. | ||
+ | |||
+ | (3) ARGOS, a replicable genome information system | ||
+ | www.gmod.org/argos/ flybase.net/argos/ eugenes.org/argos/ | ||
+ | Version 0.7 (alpha, March 2004). | ||
+ | ARGOS is used now for replicating public web-genome databases. Contains | ||
+ | all of FlyBase, euGenes, wFleaBase, and some other services. | ||
+ | Contents include 10 GB multi-genome data (euGenes), 8 GB of Drosophila | ||
+ | (FlyBase), 500 MB common software, servers, binaries). | ||
+ | |||
+ | Miscellany: | ||
+ | gmod/schema/XMLTools/ChadoSax/ reader for chado.xml provides | ||
+ | flybase annotation data access. | ||
+ | gmod/schema/GMODTools/ Perl modules using GMOD 0.001 release for | ||
+ | managing miscellany sequences (EST, GSS, etc) in Chado database | ||
+ | Used now in Daphnia / wFleaBase genome database (eugenes.org/daphnia) | ||
+ | Apollo data search/retrieval system used at | ||
+ | flybase.net/apollo/ | ||
+ | a web CGI using Chado Postgres + LuceGene | ||
+ | for retrieval Game XML annotations by | ||
+ | lookup of gene name, genome location, other attributes. | ||
+ | Tested, aided development, and used GMOD release 0.001, Postgres Chado, | ||
+ | XORT, Chado::DBI, GBrowse, etc. tools for FlyBase and wFleaBase, where | ||
+ | they now form the basis of data management. | ||
+ | |||
+ | |||
+ | |||
+ | GMOD Update from the Saccharomyces Genome Database (SGD) | ||
+ | |||
+ | Before the last GMOD meeting at Berkeley, SGD released several GMOD | ||
+ | software packages (Blast Graphic Viewer, Restriction Graphic Viewer and | ||
+ | GO Graphic Viewer). Since then, we have been working on incorporating | ||
+ | existing GMOD products into new tools and resources at SGD. Here is a | ||
+ | list of projects that are currently under development or already in | ||
+ | production. | ||
+ | |||
+ | 1. New Fungal BLAST using BLAST Graphic Viewer. | ||
+ | SGD has created a new Fungal BLAST interface using the BLAST Graphic | ||
+ | Viewer. This new tool can be used to do BLASTN or TBLASTN searches using | ||
+ | any sequence of choice against any combination of fungal sequence datasets, | ||
+ | including genome sequences of fungal model organisms and pathogens, ESTs, | ||
+ | and other fungal sequence sets in GenBank. The fungal BLAST search at SGD | ||
+ | can be accessed from this URL. | ||
+ | |||
+ | http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast-fungal.pl | ||
+ | |||
+ | |||
+ | 2. GBrowse at SGD | ||
+ | GBrowse has been set up at SGD. SGD is still testing the software | ||
+ | before making a general announcement about the availability of the | ||
+ | software. This software is running on top of a MySQL database whose | ||
+ | tables are populated from a flat file in GFF3 format (refer to the third | ||
+ | topic for detail). GBrowse at SGD can be accessed from this URL. | ||
+ | |||
+ | http://www.yeastgenome.org/cgi-bin/SGD/gbrowse/gbrowse/yeast | ||
+ | |||
+ | 3. GFF3 file format | ||
+ | SGD has started to provide the sequence features of S. cerevisiae | ||
+ | genome in a flat file, which is fully compatible with GFF3 format. | ||
+ | This file is used as the data input to load the MySQL database for | ||
+ | GBrowse and the PostgreSQL database running Chado schema for SGD Lite | ||
+ | at Princeton. This file is updated every week on SGD's ftp site. This | ||
+ | file is available for download from this URL. | ||
+ | |||
+ | ftp://genome-ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/SGDGFF3.gff | ||
+ | |||
+ | |||
+ | 4. SGD Lite and CHADO | ||
+ | The SGD colony at Princeton has been working on installing GMOD | ||
+ | release 0.002. Both versions of the Chado schema in these releases | ||
+ | (.001 and .002) have been successfully installed and loaded (via a | ||
+ | modified GFF3 file) on a desktop running Mac OS 10.3.2 using the | ||
+ | included installation scripts. We are currently working on installing | ||
+ | 0.002, including GBrowse, on an Apple X server running 10.3.2. We plan | ||
+ | to assemble installation notes/documentation and distribute them during | ||
+ | the meeting. | ||
+ | |||
+ | 5. Textpresso Beta testing | ||
+ | SGD has a wealth of literature information. We want to provide | ||
+ | expanded text searching to our users, since we have an abstract and/or | ||
+ | full text for most of our references. Textpresso is an information | ||
+ | retrieval system developed by Wormbase at Caltech. Eimear Kenny spent | ||
+ | two weeks at SGD to help set up a test version of Textpresso. The SGD | ||
+ | Textpresso can be accessed from this URL. | ||
+ | |||
+ | http://www.yeastgenome.org/textpresso/ | ||
+ | |||
+ | Currently, we are working on improving Textpresso's software | ||
+ | performance, as well as developing a yeast version of the Textpresso | ||
+ | ontology. We improved the performance of the markup script (text2xml.pl) | ||
+ | by 50%. We are also considering a few options to improve the indexing | ||
+ | mechanism. With regard to the ontology, we have modified the 'Gene' | ||
+ | and 'Localization in Time and Space' categories. We are also currently | ||
+ | working on a few other categories, such as Allele, Transgene and | ||
+ | Phenotype, in order to best reflect the biology in S. cerevisiae. | ||
+ | </pre> | ||
[[Category:Apollo]] | [[Category:Apollo]] |
Revision as of 17:43, 26 May 2008
Generic Model Organism Database Construction Set
Meeting 4
GMOD Meeting April, 2004
Presentations
- Cain_040526.ppt
- Crosby_040526.ppt
- Emmert_040526.ppt
- Gelbart_040528.ppt, Orthology
- Gilbert_040526.ppt
- Harris_040527.ppt
- Kasprzyk_040526.ppt
- Kenny_040526.ppt
- Kodira_040526.ppt
- Matthews_040526.ppt
- Sabo_040526.ppt
- Schlueter_040526.ppt
- Terry_040526.ppt
- Worley_040526.ppt
Agenda
April 26, Morning: Combined Developers and Curators section
Mount Vernon Room
9:00am | Introductions Scott Cain (CSHL) |
9:20 | Don Gilbert (FlyBase, Indiana University) |
10:30 | Break |
10:45 | Frank Smutniak (FlyBase, Harvard University) |
11:20 | Stan Letovsky (FlyBase, Harvard University) |
11:45 | Lunch (on your own-many good restaurants-check with a local) |
April 26, Afternoon: Developer section
Mount Vernon Room
1:30 | Arek Kasprzyk (EBI) |
2:00 | GMOD/Turnkey web demo Brian O'Conner (UCLA) |
2:30 | Break |
3:00 | Eimear Kenny (WormBase, CalTech) |
3:30 | Toshiaki Katayama (Human Genome Center, University of Tokyo, Japan) |
3:45 | Break |
4:00 | David Emmert (FlyBase, Harvard University) |
4:30 | Scott Cain |
April 26, Afternoon: Curator section
Terrace Room
1:30 Jennifer Wortman (The Institute for Genomic Research)
1:50 Shannon Schlueter (Arabidopsis thaliana Plant Genome Database, Iowa State University)
2:10 Aniko Sabo (Genome Sequencing Center, Washington University School of Medicine)
2:30 Break
2:50 Madeline Crosby (FlyBase, Harvard University)
3:10 Kim Worley (Human Genome Sequencing Center, Baylor College of Medicine)
3:30 Astrid Terry (Joint Genome Institute)
3:50 Break4:10Chinnappa Kodira (Broad Institute)
4:30 Michele Clamp (Broad Institute)
4:50 Break
5:00 Group discussion
6:00 Dinner (on your own-see above)
April 27, Developer section
Mount Vernon Room
9:00 | GMOD Alpha release, Part II |
The goal here is to try to get the gmod alpha installed on computers to test the installation and working issues with the release. It is almost certainly the case that there will also be time for "breakout" sessions for smaller groups to discuss a variety of topics. Suggestions will be accepted both before and during the meeting.
April 27, Curator section
Terrace Room
9:00 | Apollo Demo Sima Misra (FlyBase & Berkeley Drosophila Genome Center) |
9:40 | Nomi Harris (FlyBase & Berkeley Drosophila Genome Center) |
10:00 | Hands-on Apollo workshop for curators Breakout session for Apollo developers |
12:00 | Lunch |
1:30 | Q&A session with curators & Apollo developers |
2:30 | Hands-on Apollo workshop for curators</br> Breakout session for Apollo developers |
4:30 | Q&A session with curators & Apollo developers Group discussion |
April 27, Dinner
6:00 | Reservation on us, food paid for by you |
April 28, Combined Developer and Curator section
Mount Vernon Room
9:00 | Updates from the previous day Scott Cain and Sima Misra |
10:00 | Break |
10:15 | Bill Gelbart (FlyBase, Harvard University) |
11:15 | Break |
11:30 | Closing remarks, planning for the next meeting Scott Cain |
Progress reports
GMOD Progect Progress Reports April, 2004 ----------------------------- The past four months have seen the first two releases of gmod, which will become the suite of model organism database software. The first release, version 0.001 (alpha), was release in January, 2004. The main goal of that release was to establish a release procedure. The release consisted of a database schema, referred to as chado, which is the database schema developed primarily by FlyBase developers at Harvard and BDGP. Additionally, there were a variety of tools for installing and loading data into the database which were developed primarily by Allen Day at UCLA and Scott Cain at CSHL. Finally, there was a compatible version of the Generic Genome Browser with a chado database adaptor developed to allow browsing of genome features directly from the database. The second release, also an alpha release, consisted of the same components, and was release in March, 2004. In this release, the installation procedure improved considerably, and a prerequisite that had caused testers difficulties was removed. During the GMOD meeting in April, this release was installed by several attendees during a workshop. Several suggestions were made that will be implemented in the next release. There are several items planned for addition or improvement in the next two releases. Tools to allow importing and exporting XML formatted data from chado will be included, which will allow the sequence annotation tool, Apollo, to be used with chado. Addtionally, template based web front end for chado called turnkey will be included in an upcoming release. This software is still early in the development process, but when it was presented to developers at the GMOD meeting in April, there was considerable interest in getting it included in a gmod release as soon as possible. Longer term goals for gmod releases are including pubsearch and pubfetch. The process of porting these applications has begun and is expected to be complete by the end of the year. A tool for liturature based sequence annotation, called JavaSEAN, is expected to be included in gmod in a similar time frame. Additionally, there are plans from the Apollo developers to create a new version of Apollo that will be able to read and write directly to the database without using an XML intermidary, which will simplify the process of sequence annotation considerably. Apollo Progress Report (11/2003 - 4/2004) Major improvements in release 1.3.6 (11/3/03): Apollo now runs under JDK1.4, which works better on most platforms. Can rubberband a region on the axis and the selected sequence will pop up in a Sequence window. Results that represent hits against sequences that are new to their respective database (as indicated in tiers file) are shown with a box around them, so that the curator can immediately see which results are new and need to be looked at. Search (Find) now allows full regexps. Instead of having the config files in $HOME/.apollo be slightly modified copies of the ones in APOLLO_ROOT/conf, you can now put ONLY the stuff you want changed into your personal cfg files. Apollo will first read the ones in APOLLO_ROOT/conf, and then read your personal cfgs and apply any modifications. Synteny (see Synteny section at end) Major improvements in release 1.4.0 (internal release) (2/9/2004): New game.tiers file format (easier to read and change). If you have an old game.tiers, it will be autoconverted to the new format. Better handling of non-gene annotation types. New glyphs for showing them in main Apollo display. New annotations are automatically assigned the type (e.g. gene, tRNA, etc.) appropriate to the evidence that was used to create them. (Type can then be changed in the annotation info editor, if desired.) Structured transaction records are now added to the XML when you save. They include the type of object that changed (e.g. TRANSCRIPT; ANNOTATION; COMMENT), the operation (e.g. ADD, SPLIT, etc.), the relevant names and/or IDs before and after the transaction, and the user and time/date when the change was made. Support for translational exceptions, including frame shifts and one base pair genomic sequencing errors. UTRs are now shown in a different (configurable) color from the rest of the gene. Restriction enzyme mapper: - Cut sites show up in main window (near the axis) - Can now map multiple restriction enzymes at once - Table of restriction fragments; can be selected for viewing in Sequence window Annotation info window: - Now has integrated annotation tree - Shows arbitrary properties for annotations and transcripts (including validation_flag) - Shows translational exceptions and genomic sequencing errors - Lets you edit annotation ID as well as name/symbol Ability to tag results by selecting from a list of comments, which are specified (as ResultTags) in game.style. Tagged results are crosshatched in pink in the display. Fixed updating of peptide sequences. Improvements in releases 1.4.1 (3/12/04) and 1.4.2 (3/18/04): Red/green markers at axis show where sequence/region ends. To help you identify splice sites that are unconventional, colored triangles appear in the annotation glyph. Can now load D. melanogaster data from r3.1 (gadfly) and r3.2 (chado) (both via cgi). 1.4.3 (4/19/04): Let users get the sequence of the entire segment you're looking at, not just a rubberbanded section. [File -> Save sequence] Synteny progress, 11/03-4/04: - Synteny now works with GAME. You can load one species and then use the blast or syntenic block results to another species (for now it's pseudo) to load another species. The other species is loaded with the same range around that feature. Links between the two species are automatically derived from the blast link features that are present in both datasets (no explicit link file needs to be specified). - Database chooser was added to select the different species databases. - Able to switch back and forth from synteny data adapter to regular data adapters without restarting Apollo. - Can save and edit (edit could use some rigorous testing) - Can home in on link from link popup menu. Zooms and shows the strands of homed in link, strands not in link are hidden. - Species now zoom and scroll together by default. Can unlock zoom with shift key, and unlock scroll with menu item. - You can now config links between 2 curation sets that contain links to each other. link_type, source and hit species are specified in the linked type in the tiers file. This works with game, in theory could be made to work with other adapters that have linked data embedded in the species data. Textpresso: A progress report Eimear Kenny, Hans-Michael Mueller and Paul Sternberg Updates made to Textpresso since September 2003: Textpresso for Yeast Literature (Toward a generic MOD information retrieval/extraction search engine) SGD developers and curators met with Eimear Kenny for two weeks at the begining of March at Stanford to build a Textpresso search engine for Yeast. During that period the Textpresso software was installed on a Solaris system and three builds with a test corpus of ~400 full text journal articles were completed. In addition, the Textpresso Ontology for worm literature was modified to a functional preliminary ontology for yeast literature. Plans to expand the corpus to 10,000 yeast papers and make improvements to the yeast ontology are underway at Stanford. Integration of Textpresso into Literature Curation Pipeline We have integrated Textpresso to the Wormbase curation pipeline to expediate the extraction of genetic interaction information from the literature. A prototype curation interface has been developed to enable a curator to extract data from sentences returned by a Textpresso query for genetic interaction. We find that these Textpresso sentences are enriched 3-fold for gene-gene interactions compared to sentences that mention two or more gene names and 39-fold compared to random sentences from the literature. Textpresso MOD interface We have generated a Wormbase-like interface for Textpresso to integrate the Textpresso information retrieval engine in the Wormbase web-site. http://www.textpresso.org/cgi-bin/wb/textpressoforwormbase.cgi?allabstracts=on&searchmode=sentence&searchtargets=Paper&searchtargets=Abstract Textpresso Package Hans-Michael Mueller is working on packaging Textpresso for release in the first half of this year. Textpresso paper ... under review A Textpresso publication is currently under revision. PubFetch/PubTrack Progress Report (April 2004) PubFetch PubFetch is a tool for accessing literature from various online resources. The goal is to provide a common interface and common format to downstream applications to allow them to query different literature repositories in a single, unified fashion. PubFetch has been implemented in two forms: * Java servlet core + simple web interface to provide interactive access to PubFetch * Provides access to PubMed and Agricola databases * BioMOBY wrapper around servlet core to provide webservice access to PubFetch A variety of new features have been introduced: * Duplicate filtering - running the same search on multiple data sources results in some duplication of articles, the duplicate filter detects these articles returning a non-redundant set of data. Database Ids from both sources are maintained in the non-redundant set. * The web interface version highlights keywords in the search results to aid in review of the returned articles. * Connection to full text - a hyperlink to the full text is returned (if available from PubMed) * Filtering of 'ahead of print' articles - Abstracts are appearing in PubMed and being assigned PubMed Ids prior to being published and are being reassigned PubMed Ids after publication. PubFetch allows filtering of these ahead of print articles to retrieve only published articles. The BioMOBY interface provides the following services: * SearchPubmed - Search PubMed for given query and get PMIDs * GetPubmed - Retrieve PubMed articles in MEDLINE display format for given PMIDs * FetchFull - Get FullText for given PMID * fetchAgID - Search Agricola for given query and get Agricola accession number * fetchAgDoc - Get Agricola document in MEDLINE like format for given Agricola accession number Current work The integration of PubFetch and PubSearch is in progress, our goal is to have PubSearch using the PubFetch core module for literature retrieval by summer of 2004. We will be adapting the Rat Genome Database literature pipeline to use the PubFetch BioMOBY services to act as its source for literature data download. The current version of PubFetch is available from the GMOD cvs: http://cvs.sourceforge.net/viewcvs.py/gmod/pubfetch/ Implementation of PubSearch at RGD Following a curator review of existing PubSearch functionality, a variety of new features were requested by the RGD curators to enable a more 'article-centric' view of the PubSearch database. This has been implemented by the TAIR group and plans are underway to install this latest version of PubSearch at RGD, populate with RGD/Rat data and test in the RGD curation process. PubTrack PubTrack is a monitoring tool that tracks objects as they move through a process or workflow. Existing workflow tools move data through a specified process, passing datasets to applications and retrieving results and passing them to the next step in the flow. PubTrack does not aim to direct or control workflow and it does not track the dataset as a whole, it provides a higher resolution and tracks the data objects within the dataset, enabling users to follow a particular object as it moves through a process. Progress to date: * Review of existing workflow tools and schemas has been completed. * The initial PubTrack schema has been developed and implemented in PostgreSQL * Initialization scripts have been written to populate the PubTrack database with initial object and process data. Perl scripts are used to parse and load initialization data in a standard XML format; a DTD is available and is used to confirm the data formatting. * An API is under development to allow 3rd party applications to communicate with PubTrack to initialize and update the tracking information for objects under observation. This is being developed and tested using data from a proteomics MS/MS analysis pipeline that is being built in my lab. * A basic web user interface is in development to provide end-users with the ability to view objects and their progress through their designated processes. * The concept of 'estimated time of completion' has been added to allow long term planning and project tracking. For example, the entire process of curating an article might typically take 3 days, so the estimated time of completion would be 3 days after the start of curation. This estimate can be displayed on a Gantt chart and updated as individual steps in the process are completed, allowing an increasingly refined view of the completion date. This is being used in our proteomics tracking - component 1 generates tissue samples from animals in a process that takes upto 3 weeks to complete. By tracking the progress and updating the completion time estimate using PubTrack it allows lab members in component 2 to plan ahead. They are able to see what samples will be ready and on what date they will be ready and this is updated as the process progresses. Current Work When the API is stabilized we will deploy PubTrack in the existing RGD literature curation pipeline and ultimately in combination with PubSearch at RGD. This will create an entire system allowing tracking of literature across a heterogeneous system as it is downloaded from PubMed, into PubSearch, screened, moved to RGD's Oracle db, curated and ultimately filed. A more comprehensive user interface will be developed based on the experiences from the proteomics pipeline and the RGD curation pipeline. The goal is to provide generic tracking views and a way to allow specific users to customize the displays, charts and reports if needed. PubTrack documents including schema, loading scripts, etc. can be found on the GMOD CVS. http://cvs.sourceforge.net/viewcvs.py/gmod/pubtrack/ PubSearch update We've migrated our database schema over to one that should be more compatible with a Chado schema --- all of our table names are now prefixed with a 'pub_' prefix, and we've done some column renaming so that we use consistant names throughout the system. Our production server has been also upgraded from MySQL3 to MySQL4, and we've rewritten some parts of Pubsearch to take advantage of the transaction support that the new MySQL provides. We've also added referential integrity constraints to the foreign keys in our tables. We've adopted another tool called JCoverage to help us identify areas of our code that are not being touched by our unit cases, and have started to tighten up our test cases so that our major classes are being exercised. We've worked toward removing dependencies on external resources. Hit generation now works directly from the Java codebase, rather than from an external Python script. We've continued work on a keyword term browser to replaced the highly munged version of AmiGO that we are running locally. GBrowse Project Coordinator: Lincoln Stein Major Developers: Scott Cain Aaron Mackey Toshiaki Katayama Vsevolod Ilyushchenko Marc Logghe Sheldon McKay Mark Wilkinson DESCRIPTION: GBrowse is a web-based browser for genome annotations. It is intended to complement Apollo by providing a search, browse and drill-down display for sequence-based features without the need for prior software installation. GBrowse uses a database adaptor system to connect to a single primary data source, and a temporary flat-file system to layer an arbitrary number of third-party annotations on top of the primary data. A plugin system is used to add new functionality to gbrowse, such as more advanced searches, and dynamically-computed features such as ab initio gene predictions. An internationalization layer allows GBrowse to display button labels, menus and help text in a variety of common world languages. The following gbrowse database adaptors currently exist: Bio::DB::GFF (oracle, postgresql & mysql) Well-tested and in production. Bio::DB::Das::Chado (postgresql) Well-tested and in early production. GenBank proxy Well-tested and in production. Does not handle full-genbank keyword searches properly. Bio::DB::Das::BioSQL Adaptor for the BioSQL schema. In beta test. Bio::Das Adaptor for DAS sources. Released, but probably best considered in beta test. GBrowse has been downloaded from SourceForge 1,830 times, but this is a poor way to count the number of GBrowse users. A more conservative estimate of users comes from tallying bug reports, which ensures that the user has at least tried to install the software. However, it represents an undercount. In any case, we can confirm that at least 100 laboratories have installed GBrowse. As the list attached to the bottom of this report shows, GBrowse can be found in academic, governmental and commercial organizations in North America, South America, Europe, Asia, Africa and Australia. RECENT PROGRESS: Since the last status report, we have added the following features to GBrowse: 1) SVG output Users can now click on a link labeled "Publication Quality Image" and download a Scaleable Vector Graphics version of the current view. SVG is an editable format that can be manipulated with popular graphics programs such as Adobe Illustrator, and can be reprinted by journals without the pixelation that plagues bitmapped images. 2) Security Tracks can now be protected by username & password, restricted to certain hosts, or limited to hosts presenting certain classes of RSA (digital) certificates. A restricted track does not appear on the screen of unauthorized users, allowing system administrations to present a mix of proprietary and public data. 3) DAS support GBrowse can now run on top of distributed annotation system sources. DAS is supported in three ways: a) As an external annotation source Users can layer remote DAS tracks on top of the current view. The remote DAS tracks will remain active from session to session. The GBrowse administrator can preconfigure a set of "recommended" DAS sources, which will then appear in a user-selectable menu. b) As a primary database GBrowse can now be configured to use a local or remote DAS database as its primary data source. This means that one can point GBrowse at the UCSC or ENSEMBL databases and immediately begin browing them using the GBrowse user interface. c) As a DAS source GBrowse will act as a DAS server. At the administrator's discretion, all or selected tracks can be made exportable via DAS, allowing sequence features be shared between GBrowse instances or between GBrowse and other DAS clients. 4) Feature filtering and highlighting A new filtering and highlighting API allows plugins to hide features based on a set of user-supplied criteria or to highlight them in various colors. 5) New adaptors In addition to the DAS adaptor, we have added an experimental BioSQL adaptor to GBrowse. BioSQL is a flexible database schema designed by the BioPerl & BioJava projects for the purposes of holding GenBank/EMBL records in a relational format. 6) Support for GFF3 loading & dumping GBrowse now can load and dump sequence annotations in GFF3 format (http://song.sourceforge.net), a preliminary specification that improves on the current GFF sequence feature format. The advantage of this format is that it uses the Sequence Ontology, a controlled vocabulary of sequence feature types. 7) Integrated MOBY support The BioMOBY system (www.biomoby.org) is a web services system that allows users to quickly locate and invoke bioinformatics services. GBrowse now has an interface which allows it to find services that will operate on selected sequence features. For example, GBrowse can present users with a list of current services that will operate on Drosophila gene names. 8) Support for writeback A writeback layer has been added to GBrowse to allow external editors to update the underlying database. This has been tested successfully with the Artemis editor in the context of a USDA pathogens database project. Testing with Apollo is still underway. Currently it is recommended to edit sequence databases via the shared Chado schema and the Apollo->Chado->GBrowse route, rather than to use Apollo->GBrowse directly. 9) New glyphs We have recently added a number of new glyphs for use with the International HapMap Project. New glyphs include a "weighted allele" glyph that indicates the major and minor alleles of a single nucleotide polymorphism, and a set of glyphs for visualizing haplotype blocks. 10) Bug fixes Performance has been improved when uploading large 3d party annotation files. Nucleotide-level alignments have been fixed when the display is "flipped." The feature name search methods have been cleaned up to provide more consistent behavior. PLANS FOR THE FUTURE: Performance is a concern when viewing large numbers of uploaded third-party features. We plan to fix this by implementing a indexed flat file cache for uploaded features. The user interface needs to be improved in some respects. One useful idea is to place an icon to the left of each track to indicate whether it is in a expanded or collapsed state. The ability to use a different DAS source for each track, which is a feature of ISB GBrowse, will be ported over. As always, we are looking for volunteers fluent in non-English languages to create and update the internationalization files. Contact: Lincoln Stein <lstein@cshl.org> APPENDIX. Confirmed users of GBrowse: Agricultural Biotechnology Center, Hungary BAWI, S. Korea Baylor College of Medicine Biocrates GmbH, Innsbruck Brandeis University Bristol-Meyers Squibb British Columbia Centre for Diseaes Control CIRAD, France CSIRO, Australia Cambridge University (multiple labs) Center for Genomics & Bioinformatics, Stockholm Center for Genomics and Bioinformatics, Stockholm Centre de Genetique Moleculaire, CNRS Cold Spring Harbor Laboratory (multiple labs) Compugen Concordia University, Canada Cornell Medical School Cornell University DNA Landmarks, Inc. Donald Danforth Plant Sciences Center Duke University (multiple labs) EMBL, Heidelberg EuGenes (hacked copy) Faculdade de Medicina de Ribeiro Preto, So Paulo FlyBase Foundation for Research and Technology, Crete Fundao Hemocentro, Sao Paolo Genoscope, France GrainGenes Harvard University Hospital for Sick Kids, Toronto Illinois Institute of Technology Incyte Corporation Inpharmatica, Ltd. Institute for Systems Biology, Seattle Institute of Molecular and Cell Biology, Singapore International Rice Research Institute, Phillipines John Innes Centre KEGG Kansas State University Karolinska Institute Kennedy Krieger Institute Lawrence Berkeley Laboratories Marine Biological Laboratories, Woods Hole Massachusetts Institute of Technology (multiple labs)\ Mayo Institute McGill University Meat Animal Research Center, University of Nebraska Medical University of South Carolina Michigan State University NHGRI, NIH National Cancer Institute, Frederick Cancer Center New York University (multiple labs) North Carolina State University Northern Illinois University Northwestern University Oklahoma State University Open Informatics Consulting Corp. Oxagen Corp. Pasteur Institute, Paris Pioneer Corporation QIAGEN Operon Corp. RIKEN (multiple labs) RatDB Regulome, Inc. Rhobio (Bayer CropScience SA & Biogemma joint venture) Rigshospitalet, Copenhagen Rockefeller University Roslin Institute, Edinburgh Russian Academy Medical Sciences Serono International Corp, Geneva Simon Frasier University South Africa National Bioinformatics Institute Southern Illinois University St. Jude Children's Research Hospital, Memphis Stowers Institute for Medical Research Texas A&M (multiple labs) The Institute for Genome Research Tulane University Tulane University University California Davis University of Arizona (multiple labs) University of British Columbia University of California Santa Barbara University of Georgia (multiple labs) University of Minnesota University of Muenster University of New South Wales, Australia University of Oklahoma (multiple labs) University of Pennsylvania (multiple labs) University of Southern California University of Texas University of Toronto University of Virginia University of Washington Universitt Giessen Universit de Lige, Belgium Wageningen Universiteit & Researchcentrum, Netherlands Washington University at St. Louis (multiple labs) WormBase deVGen, Belgium CMAP Main developer: Ken Clark Recent improvements include: * Now CGI-based (no more mod_perl dependencies), making installation much easier (and much more like Gbrowse) * Added SVG output * Added multiple aliases for features * Added support for arbitrary attributes for db objects * New cross-reference scheme allows for unlimited xrefs on most db objects * Experimental XML export/import of data added * User tutorial added * Faster, fewer bugs, etc. CMAP is known to be in use by: Barry Marler (Andy Paterson), Alex Feltus, Pratt: UGA Rex Nelson, Chet Langin, Xiaokang Pan: Iowa State Michelle Bobo: Oregon Health & Science University Victor Ulat, Richard Bruskiewich: IRRI Matthew Hobbs: University of Sydney (Australia) Pathway Tools Status Report Peter Karp February 5, 2004 Please note that the full history of updates to Pathway Tools can be found at URL http://bioinformatics.ai.sri.com/ptools/release-notes.html Significant updates funded under this grant since the last report are as follows. o We have implemented the proposed Napster-like peer-to-peer sharing of Pathway/Genome Databases via a central network registry server. Pathway Tools users will be able to use the software to register new PGDBs that they create to this central registry server at SRI, and they will be able to use the software to browse the registry and to retrieve and install PGDBs listed there for local analysis. o Pathway Tools has been extended to support annotation of protein domains, sites, and chemical modifications. We have created an ontology of domain, sites, and modification types. The Pathway/Genome Editor tools have been extended to allow users to interactively annotate these features on protein sequences, and the Pathway/Genome Navigator has been extended to display these annotated features. o We have added a batch-processing mode to the portion of Pathway Tools that creates new Pathway/Genome Databases to allow large-scale automated processing of multiple genomes without manual intervention. We have undertaken a collaboration with the European Bioinformatics Institute, who are interested in applying Pathway Tools to generate Pathway/Genome Databases for a large number of genomes. o We have integrated an algorithm for pathway hole filling into Pathway Tools. A pathway hole is a reaction step in a metabolic pathway for which no enzyme has been identified in the genome of an organism. The pathway hole filler uses a combination of techniques to predict which genes in the genome code for these missing enzymes. [This algorithm developed under separate funding.] o We have completely re-designed the menus of the desktop version of Pathway/Genome Navigator to be more consistent with other graphical interfaces, more intuitive to the user, and to provide more screen area to display of visualizations. o We have integrated an SBML (Systems Biology Markup Language) output tool written in the Church lab at Harvard into Pathway Tools, allowing the reaction network within a Pathway/Genome Database to be exported to SBML format, from which it can be imported into a number of simulation and analysis software packages. o We have reworked the display of information about protein complexes within Pathway Tools to increase the clarity of this information. o The preceding capabilities will be present in the February release of Pathway Tools. o We have received many emails from users reporting bugs, and asking for information. o 80 groups have licensed Pathway Tools to date. o Pathway/Genome Databases available through the web include: o Saccharomyces cerevisiae, Stanford University http://pathway.yeastgenome.org/biocyc/ o Plasmodium falciparum, Stanford University plasmocyc.stanford.edu o Mycobacterium tuberculosis, Stanford University BioCyc.org o Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington Arabidopsis.org:1555 o Methanococcus janaschii, EBI Maine.ebi.ac.uk:1555 (availability intermittent) Pathway Tools Status Report Peter Karp April 20, 2004 Please note that the full history of updates to Pathway Tools can be found at URL http://bioinformatics.ai.sri.com/ptools/release-notes.html Significant updates funded under this grant since the last report in February 2004 are as follows. o Version 8.0 of Pathway Tools was released on March 12, 2004. SRI continues to hold to our planned schedule of two releases of Pathway Tools per year. o 275 groups have licensed Pathway Tools to date. The large jump in this number since the last report reflects the fact that these numbers also include groups who use Pathway Tools to query existing Pathway/Genome Databases (not reported earlier), in addition to groups who use it to create new databases. o We have made very significant progress on development of an algorithm to automatically lay out the one-page metabolic overview diagram that shows the full metabolic network of an organism -- the algorithm is now working. We are also in the process of adding new components of the cellular machinery to this diagram. o SRI has hosted two 4-day training sessions for Pathway Tools. The dates and 26 attendees are listed below. Most attendees have brought genomes with them to the training sessions, and have left with draft Pathway/Genome Databases. Tutorial on March 15-18, 2004 1. John Burke Biotique Inc. 2. Guillaume Meurice Pasteur Institute 3. David Simon Pasteur Institute 4. Gregory P. Fournier MIT 5. Alex Picone Biatech 6. John Bashkin SRI 7. Tit Yee wong University of Memphis 8. Ken Kaufman UC Berkeley 9. Jeremy Glasner University of Wisconsin 10. Lisa Herron-Olson University of Minnesota 11. Devaki Bhaya Carnegie Institution Tutorial on April 19-22, 2004 1 Dr. Matthew Berriman The Wellcome Trust Sanger Institute T. brucei & L. Major 2 Herbert Chiang Washington University Bacteroides thetaiotaomicron 3 Clinton Fernandez University of British Columbia Rhodococcus sp. RHA1 (~10MB) 4 Lisa Koski University of Montreal, Canada 5 Rebecca Krupp UCLA Methanosarcina acetivorans 6 Joanne Luciano BioPathways Consortium Prochlorococcus marinus MED4 7 Jasintha Maniraja Universite Libre de Bruxelles Mus musculus 8 Linyong Mao Pacific Northwest National Laboratory Shewanella oneidensis 9 Michael P. McLeod University of British Columbia Rhodococcus sp. RHA1 (~10MB) 10 Dylan Morris CalTech Mycoplasma genitalium 11 Gavin Murphy CalTech Bdellovibrio 12 Joo-Heon Park University of Tex-Houston Med School Treponema pallidum 13 Liviu Popescu Cornell University, Computer Science Sacaromyces cerevisae 14 Christopher Reigstad Washington University unpublished uropathogenic E. coli 15 Haluk Resat Pacific Northwest National Laboratory 16 Jian Song Los Alamos National Laboratory Pseudomonas aeruginosa GMOD Project Status April 2004 D. Gilbert (gilbertd@indiana.edu) Project members: Don Gilbert, Josh Goodman, Paul Poole, Vasanth Singan (student), at Indiana University. Projects in development for GMOD: (1) LuceGene, document/object search/retrieval for genome data www.gmod.org/lucegene/ eugenes.org:8081/gmod/lucegene/ version 1.2 (alpha), released for public use April 2004. In use at FlyBase.net, euGenes.org, wFleaBase. LuceGene is similar in concept to the bioinformatic databank access tool SRS, and web search systems such as Google. Based on Lucene, this Java program is fast and flexible at search and retrieval of complex data objects. It outperforms Chado Postgres database by 10x or more at gene object retrieval. (2) Genome Directory System, data mining access to genome data www.gmod.org/gds/ In development, web services for SOAP access to genome data and bio sequence databanks. Plan to provide production data mining services through this including FlyBase, euGenes genomes and Bio-Mirror/IUBio biosequence databanks. Will add to ARGOS package for genome databases. Includes plan to test FlyBase data analyses over TeraGrid, Fall 2004. (3) ARGOS, a replicable genome information system www.gmod.org/argos/ flybase.net/argos/ eugenes.org/argos/ Version 0.7 (alpha, March 2004). ARGOS is used now for replicating public web-genome databases. Contains all of FlyBase, euGenes, wFleaBase, and some other services. Contents include 10 GB multi-genome data (euGenes), 8 GB of Drosophila (FlyBase), 500 MB common software, servers, binaries). Miscellany: gmod/schema/XMLTools/ChadoSax/ reader for chado.xml provides flybase annotation data access. gmod/schema/GMODTools/ Perl modules using GMOD 0.001 release for managing miscellany sequences (EST, GSS, etc) in Chado database Used now in Daphnia / wFleaBase genome database (eugenes.org/daphnia) Apollo data search/retrieval system used at flybase.net/apollo/ a web CGI using Chado Postgres + LuceGene for retrieval Game XML annotations by lookup of gene name, genome location, other attributes. Tested, aided development, and used GMOD release 0.001, Postgres Chado, XORT, Chado::DBI, GBrowse, etc. tools for FlyBase and wFleaBase, where they now form the basis of data management. GMOD Update from the Saccharomyces Genome Database (SGD) Before the last GMOD meeting at Berkeley, SGD released several GMOD software packages (Blast Graphic Viewer, Restriction Graphic Viewer and GO Graphic Viewer). Since then, we have been working on incorporating existing GMOD products into new tools and resources at SGD. Here is a list of projects that are currently under development or already in production. 1. New Fungal BLAST using BLAST Graphic Viewer. SGD has created a new Fungal BLAST interface using the BLAST Graphic Viewer. This new tool can be used to do BLASTN or TBLASTN searches using any sequence of choice against any combination of fungal sequence datasets, including genome sequences of fungal model organisms and pathogens, ESTs, and other fungal sequence sets in GenBank. The fungal BLAST search at SGD can be accessed from this URL. http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast-fungal.pl 2. GBrowse at SGD GBrowse has been set up at SGD. SGD is still testing the software before making a general announcement about the availability of the software. This software is running on top of a MySQL database whose tables are populated from a flat file in GFF3 format (refer to the third topic for detail). GBrowse at SGD can be accessed from this URL. http://www.yeastgenome.org/cgi-bin/SGD/gbrowse/gbrowse/yeast 3. GFF3 file format SGD has started to provide the sequence features of S. cerevisiae genome in a flat file, which is fully compatible with GFF3 format. This file is used as the data input to load the MySQL database for GBrowse and the PostgreSQL database running Chado schema for SGD Lite at Princeton. This file is updated every week on SGD's ftp site. This file is available for download from this URL. ftp://genome-ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/SGDGFF3.gff 4. SGD Lite and CHADO The SGD colony at Princeton has been working on installing GMOD release 0.002. Both versions of the Chado schema in these releases (.001 and .002) have been successfully installed and loaded (via a modified GFF3 file) on a desktop running Mac OS 10.3.2 using the included installation scripts. We are currently working on installing 0.002, including GBrowse, on an Apple X server running 10.3.2. We plan to assemble installation notes/documentation and distribute them during the meeting. 5. Textpresso Beta testing SGD has a wealth of literature information. We want to provide expanded text searching to our users, since we have an abstract and/or full text for most of our references. Textpresso is an information retrieval system developed by Wormbase at Caltech. Eimear Kenny spent two weeks at SGD to help set up a test version of Textpresso. The SGD Textpresso can be accessed from this URL. http://www.yeastgenome.org/textpresso/ Currently, we are working on improving Textpresso's software performance, as well as developing a yeast version of the Textpresso ontology. We improved the performance of the markup script (text2xml.pl) by 50%. We are also considering a few options to improve the indexing mechanism. With regard to the ontology, we have modified the 'Gene' and 'Localization in Time and Space' categories. We are also currently working on a few other categories, such as Allele, Transgene and Phenotype, in order to best reflect the biology in S. cerevisiae.