Difference between revisions of "August 2009 GMOD Meeting"

From GMOD
Jump to: navigation, search
m
m
 
(228 intermediate revisions by 16 users not shown)
Line 1: Line 1:
 
<center>
 
<center>
 
{| style="vertical-align: middle; border: 2px solid #A6A6BC; text-align: center" cellpadding="10"
 
{| style="vertical-align: middle; border: 2px solid #A6A6BC; text-align: center" cellpadding="10"
| <span style="font-size: 200%; line-height: 120%"><b>August 2009 GMOD Meeting</b><br />6-7 August, 2009<br />Oxford UK</span>
+
| <span style="font-size: 200%; line-height: 120%"><b>August 2009 GMOD Meeting</b><br />6-7 August, 2009<br />Oxford UK</span><br><span style="font-size: 160%; line-height: 120%">Part of [[GMOD Europe 2009]], five days of GMOD including a [[2009 GMOD Summer School - Europe|GMOD Summer School]]</span>
<br /><span style="font-size: 160%; line-height: 120%">Part of [[GMOD Europe 2009]], five days of GMOD including a [[2009 GMOD Summer School - Europe|GMOD Summer School]]</span>
+
<br>[[File:Aug2009MeetingPhoto.JPG|August 2009 GMOD Meeting]]
| {{#icon: GMOD2009Europe300.png|GMOD Europe 2009||GMOD Europe 2009}}
+
 
 +
| [[File:GMOD2009Europe300.png|link=GMOD Europe 2009|GMOD Europe 2009]]
 
|}
 
|}
 
</center>
 
</center>
__NOTITLE__
 
  
The next [[Meetings|GMOD Community Meeting]] will be held 6-7 August, 2009, in Oxford UK.  The meeting will be a part of '''''[[GMOD Europe 2009]]''''', a week long event that also includes a [[2009 GMOD Summer School - Europe|GMOD Summer School]].  This is the first time a GMOD meeting has been held in Europe.
 
  
If you want to know what happens at a GMOD meeting, see the writeup of the [[January 2009 GMOD Meeting|January 2009]] or [[July 2008 GMOD Meeting|July 2008]], or any [[Meetings|other previous meeting]].
+
This [[Meetings|GMOD Community Meeting]] was held 6-7 August, 2009, in Oxford UK.  The meeting was a part of '''''[[GMOD Europe 2009]]''''', a week long event that also included a [[2009 GMOD Summer School - Europe|GMOD Summer School]].  This is the first time a GMOD meeting has been held in Europe.
  
 +
As with previous [[Meetings|GMOD meetings]], this meeting had a mixture of project talks, component talks, and user talks.  The agenda was driven by [[#Agenda Suggestions|attendee suggestions]].  The two previous [[meetings]] were the [[January 2009 GMOD Meeting|January 2009]] and [[July 2008 GMOD Meeting|July 2008]] meetings.  GMOD meetings are an excellent way to meet GMOD developers and users, and to learn (and affect) what's coming in the project.
  
__TOC__
 
  
 +
= Schedule =
  
== Cost and Regristration ==
+
<div class="emphasisbox">
 +
<div style="text-align: center; font-size: 150%; padding-bottom: 0.3em">Heng Li</div>
 +
<div style="text-align: center; font-size: 120%">[http://www.sanger.ac.uk Wellcome Trust Sanger Institute]</div>
 +
<br />
 +
[http://www.sanger.ac.uk/Users/lh3/ Dr Heng Li] of the Sanger Institute was the special guest speaker. Heng [[#Quest for Standard: Sequence alignment/map format (SAM) and SAMtools|discussed]] his recent work on [http://samtools.sourceforge.net SAMtools], a set of file formats and scripts for efficiently storing and accessing next generation sequence data.  Heng is a developer on several projects focused on next generation sequencing, including [http://samtools.sourceforge.net SAMtools], BWA, and [http://maq.sourceforge.net MAQ].</div>
  
The cost and registration dates have not yet been set for this meeting. Once they are set, we'll announce them here, on the [[GMOD News]] page/RSS feed, and to the the [[GMOD Mailing Lists]].
+
{| class="wikitable" border="1" cellpadding="5" cellspacing="0"
 +
|-
 +
! width="12%" | Date
 +
! width="8%"  | Time
 +
! width="70%" | Session
 +
! width="10%" | Link(s)
 +
|-
 +
| colspan="4" |
 +
|-
 +
! rowspan="11" | Thursday<br />6 August
 +
! style="background-color: #fefefe; color: #aaaaaa" | 8:30-12:00
 +
| style="background-color: #fefefe; color: #aaaaaa" align="center" |Last half day of [[2009 GMOD Summer School - Europe|<span style="color: #aaaaff">2009 GMOD Summer School - Europe</span>]]
 +
|-
 +
| colspan="3" |
 +
|-
 +
! 13:30-14:30
 +
| align="center" | [[User:Scott|Scott Cain]] - Introductions and the State of GMOD
 +
| [http://prezi.com/143773/ Prezi], [[Media:Aug2009StateOfGMOD.ppt|PPT]], [[Media:Aug2009StateOfGMOD.pdf|PDF]], [[#GMOD Project Talks|Summary]]
 +
|-
 +
! 14:30-15:00
 +
| align="center" | [[User:Clements|Dave Clements]] - [[GMOD Help Desk]] Stuff
 +
| [ftp://ftp.gmod.org/pub/gmod/Meetings/2009/August/Aug2009GMODHelpDesk.ppt PPT], [[Media:Aug2009HelpDesk.pdf|PDF]], [[#GMOD Project Talks|Summary]]
 +
|-
 +
! 15:00-15:30
 +
| align="center" | [[User:JunZhao|Jun Zhao]] - Linked Data for GMOD Databases
 +
| [[Media:Aug2009LinkedData.pdf|PDF]], [[#Linked Data for GMOD Databases|Summary]]
 +
|-
 +
! 15:30-15:45
 +
| align="center" | Coffee Break
 +
|
 +
|-
 +
! 15:45-16:15
 +
| align="center" | Steve Taylor - GMOD in the Trenches
 +
| [ftp://ftp.gmod.org/pub/gmod/Meetings/2009/August/GMODInTheTrenches.pdf PDF], [[#GMOD in the Trenches|Summary]]
 +
|-
 +
! 16:15-16:30
 +
| align="center" | [[User:Scott|Scott Cain]] (for [[User:RBuels|Robert Buels]]) - A [http://search.cpan.org/~ribasushi/DBIx-Class-0.08108/ DBIx::Class] layer for [[Chado]]
 +
| [http://gmod.org/dbic_chado_slides/start.html S5 Slides], [[#A DBIx Class layer for Chado|Summary]]
 +
|-
 +
! 16:30-17:00
 +
| align="center" | [[User:Elee|Ed Lee]] - [http://code.google.com/p/gbol GMOD Biological Object Layer]
 +
| [[Media:Aug2009Gobol.pdf|PDF]], [[#GMOD Biological Object Layer|Summary]]
 +
|-
 +
! 17:00-17:30
 +
| align="center" | [[User:Jogoodma|Josh Goodman]] - A Restful interface for MODs
 +
| [[#A Restful interface for MODs|Summary]]
 +
|-
 +
! 17:30
 +
| align="center" | Dinner (on your own)
 +
|-
 +
| colspan="4" |
 +
|-
 +
! rowspan="12" | Friday<br />7 August
 +
!  8:45-9:15
 +
| align="center" | [http://www.sanger.ac.uk/Users/lh3/ Heng Li] - Quest for Standard: Sequence alignment/map format (SAM) and [http://samtools.sourceforge.net SAMtools]
 +
| [[Media:Aug2009Sam.pdf|PDF]], [[#Quest for Standard: Sequence alignment/map format (SAM) and SAMtools|Summary]]
 +
|-
 +
!  9:15-9:45
 +
| align="center" | [[User:Clements|Dave Clements]] - Visualising NGS Data in GBrowse 2
 +
| [ftp://ftp.gmod.org/pub/gmod/Meetings/2009/August/Aug2009NGSinGBrowse.ppt PPT], [ftp://ftp.gmod.org/pub/gmod/Meetings/2009/August/Aug2009NGSinGBrowse.pdf PDF], [[#Visualising NGS Data in GBrowse 2|Summary]]
 +
|-
 +
!  9:45-10:15
 +
| align="center" | Erick Antezana & Frederic Potier - GBrowse: Lessons Learned and Statement of Interest
 +
| [[Media:Aug2009GBrowse2ImplPersp.pdf|PDF]], [[#GBrowse: Lessons Learned and Statement of Interest|Summary]]
 +
|-
 +
!  10:15-11:45
 +
| align="center" | Ian Holmes - [[JBrowse]]
 +
| [[Media:Aug2009JBrowse.pdf|PDF]], [[#JBrowse|Summary]]
 +
|-
 +
!  10:45-11:00
 +
| align="center" | Coffee Break
 +
|-
 +
!  11:00-11:30
 +
| align="center" | [[User:Mckays|Sheldon McKay]] - [[GBrowse_syn]]
 +
| [[Media:Aug2009GBrowse_syn.pdf|PDF]], [[#GBRowse_syn|Summary]]
 +
|-
 +
!  11:30-12:30
 +
| align="center" | Discussion: [[Next Generation Sequencing|NextGen]] data and GMOD: What do we do (and not do)?
 +
|
 +
|-
 +
!  12:30-13:30
 +
! Catered Lunch
 +
|-
 +
!  13:30-14:00
 +
| align="center" | Alessandra Bilardi - [http://gbrowse.org GBrowse.org]
 +
| [[Media:Aug2009GBrowseOrg.pdf|PDF]], [[#GBrowse.org|Summary]]
 +
|-
 +
!  14:00-14:30
 +
| align="center" | Jonathan Warren - [[DAS]] update
 +
| [[Media:Aug2009DASUpdate.ppt|PPT]], [[Media:Aug2009DASUpdate.pdf|PDF]], [[#DAS update|Summary]]
 +
|-
 +
!  14:30-15:00
 +
| align="center" | Julie Sullivan - [[InterMine]] update
 +
| [[#InterMine update|Summary]]
 +
|-
 +
! 15:00-18:00
 +
| align="center" | Show and Tell, Discussion
 +
| [[#Show and Tell, Discussion|Summary]]
 +
|}
  
== Lodging ==
+
= Presentations =
 +
 
 +
== GMOD Project Talks ==
 +
[[File:Aug2009Scott.JPG|right|link=User:Scott|Scott]]
 +
''[[User:Scott|Scott Cain]], Ontario Institute for Cancer Research, [http://prezi.com/143773/ Intro], What's New [[Media:Aug2009StateOfGMOD.ppt|PPT]], [[Media:Aug2009StateOfGMOD.pdf|PDF]]''<br />
 +
''[[User:Clements|Dave Clements]], [http://nescent.org NESCent], Help Desk Update, [ftp://ftp.gmod.org/pub/gmod/Meetings/2009/August/Aug2009GMODHelpDesk.ppt PPT], [[Media:Aug2009HelpDesk.pdf|PDF]]''
 +
 
 +
=== HHMI Science Education Alliance ===
 +
 
 +
The [http://www.hhmi.org/grants/sea/ Howard Hughes Medical Instutute's Science Education Alliance (SEA)] is using GMOD tools to teach annotation to college freshmen.  They isolate and sequence phage samples.  The sequence is then stored in [[Chado]], annotated with [[Apollo]] and visualized with [[GBrowse]].  In production at 12 colleges across the US.
 +
 
 +
=== What's new ===
 +
 
 +
; [[Chado]] (GMOD) 1.1 is coming
 +
* Minor schema changes
 +
* Minor fixes to [[GFF]] scripts
 +
* Addition of Chris Mungall's script to create views based on [[Chado CV Module|CV]] terms.
 +
 
 +
; [[GBrowse]]
 +
*; [[:Category:GBrowse 2|GBrowse 2]]
 +
** Distributed databases and render servers
 +
** AJAX track loading
 +
** Improved configuration management
 +
*; [[Next Generation Sequencing]] in [[GBrowse]]
 +
** Support for SAM/BAM databases - see [http://samtools.sourceforge.net/ SAMtools]
 +
** Coverage XY-plot, Confidence density plot, Individual alignments, Paired reads
 +
** Currently Alpha in GBrowse 2; may work in GBrowse 1 with some [[DAS]] magic.
 +
*; Circular chromosome support
 +
** Can scroll through origin, and features can span origin
 +
** Coming in [[GBrowse]] 1.71
 +
** Developed by Nathan Liles of the [http://ecoliwiki.net EcoliWiki] project.
 +
 
 +
; [[JBrowse]]
 +
* Another complete rearchitecture
 +
* Uses AJAX for client side rendering
 +
 
 +
; [[GBrowse_syn]]
 +
* Distributed with GBrowse 1.70
 +
* Makes use of data adaptors/databases that [[GBrowse]] uses
 +
 
 +
; [[Tripal]]
 +
* [[Tripal]] is a set of modules for [http://drupal.org Drupal] to interact with a [[Chado]] database.
 +
* Drupal: Widely used CMS, very extensible
 +
* Can integrate [[GBrowse]]/[[CMap]]
 +
* Modules for [[Chado Organism Module|organism]], [[Chado Library Module|library]], [[Chado Sequence Module|sequence]]
 +
 
 +
; [[DIYA]]
 +
 
 +
[[DIYA]] is a gene prediction pipeline for prokaryotes.  It complements [[MAKER]], a pipeline for eukaryotes.  DIYA is actually a generic, lightweight pipeline framework which was initially built to produce gene predictions.  DIYA is becoming part of GMOD.
 +
 
 +
; Atlases and [http://aniseed-ibdm.univ-mrs.fr/ Aniseed]
 +
 
 +
[http://aniseed-ibdm.univ-mrs.fr/ Aniseed] is converting its schema to [[Chado]].  One of Aniseed's particular strengths is atlases for expression, anatomy, and cell fate.  They are extending Chado to better support atlases, and will also make their web front end available as a part of GMOD.
 +
 
 +
=== [[GMOD Summer School]] ===
 +
{| class="wikitable"
 +
! colspan="2" | 2008
 +
| colspan="2" |
 +
[[2008 GMOD Summer School]] - first school ever offered
 +
* 2 1/2 days at [http://nescent.org NESCent]
 +
* 5 [[GMOD Components]] covered; 4 instructors
 +
* ~30 applicants for 25 slots
 +
|-
 +
! colspan="2" | 2009
 +
|
 +
[[2009 GMOD Summer School - Americas]]
 +
* 4 days at [http://nescent.org NESCent]
 +
* 8 GMOD Components covered; 9 instructors
 +
* 52 applications for 25 slots
 +
|
 +
[[2009 GMOD Summer School - Europe]]
 +
* 3 1/2 days at University of Oxford
 +
* 7 GMOD Components covered; 10 instructors
 +
* 58 applications for 25 slots
 +
|}
 +
That's an over 350% increase in interest from 2008.
 +
 
 +
We'll do another summer school at [http://nescent.org/ NESCent] in 2010.  We are also considering one in Asia/Pacific in 2010.
 +
 
 +
=== Outreach ===
 +
 
 +
* Since [[January 2009 GMOD Meeting]] we've been busy - see [[Training and Outreach#Conference Workshops, Presentations, and Posters|Training and Outreach]].
 +
* And in the next few months
 +
** Half Day GMOD Workshop, preceding [http://www.ausbiotech2009.com.au/bia/bia-home Bioinformatics Australia], 28 October, ''probably''.
 +
** [[Comparative Genomics]] in GMOD, at [http://colloque.inra.fr/isyip Information Systems for Insect Pests], Rennes, France, 16-17 November
 +
 
 +
=== GMOD Community Surveys ===
 +
GMOD is now surveying the community every year.  The [[2008 GMOD Community Survey]] had 89 and is very informative about how GMOD is used.  The 2009 survey will be in October
 +
 
 +
=== Upcoming GMOD Hackathon ? ===
 +
There may be a GMOD hackathon this coming spring (March to May) at [http://nescent.org US National Evolutionary Synthesis Center (NESCent)] in Durham, NC, USA.  If this happens the focus will be on extending GMOD for [[:Category:Evolution|evolutionary biology]].  Contact [[User:Clements|Dave]] if you want to be on organizing committee or participate.
 +
 
 +
== [http://en.wikipedia.org/wiki/Linked_Data Linked Data] for GMOD Databases ==
 +
[[File:Aug2009JunPhoto.JPG|right|link=User:JunZhao|Jun Zhao]]
 +
 
 +
''[[User:JunZhao|Jun Zhao]], Department of Zoology, University of Oxford, [[Media:Aug2009LinkedData.pdf|PDF]]''
 +
 
 +
Jun first introduced the [http://www.w3.org/RDF/ Resource Description Framework (RDF)] and the [http://www.w3.org/TR/rdf-sparql-query/ SPARQL language] for querying it.
 +
 
 +
=== OpenFlyData ===
 +
 
 +
Jun discussed her group's efforts to build an RDF [http://en.wikipedia.org/wiki/Triplestore triple store] from several very different data sources: [[:Category:FlyBase|FlyBase]] (a [[Chado]] database), [http://www.fruitfly.org/ BDGP], [http://www.fly-ted.org/ FlyTED], [http://www.flyatlas.org/ FlyAtlas], and Affymetrix data sources.  The integrated triple-store can be accessed at [http://openflydata.org OpenFlyData].
 +
 
 +
They used
 +
* [http://www4.wiwiss.fu-berlin.de/bizer/d2rq/ D2RQ] mapping to load FlyBase and BDGP, using conservative mapping with minimum interpretation.
 +
* [http://zoo-garos.zoo.ox.ac.uk/ibrg/index.php/FlyTED#Tools_and_software OAI2SPARQL] to harvest N3 RDF metadata via the OAI-PMH protocol, using built-in support by Eprints, and further info from ESWC2008 paper.
 +
* [http://zoo-garos.zoo.ox.ac.uk/ibrg/index.php/FlyTED#Tools_and_software Custom Python program] to get FlyAtlas data.
 +
 
 +
Some performance numbers:
 +
* Loading: Our datasets ~175 million triples
 +
** [http://jena.sourceforge.net/ Jena] / [http://jena.hpl.hp.com/wiki/TDB TDB] gives much better load performance (~15-30K tps), on 64 bit system with [http://aws.amazon.com/ebs/ Amazon EBS storage] (~3hrs)‏
 +
* Querying:
 +
** Good enough for real time user interaction, e.g., <1s for single gene search, 1-4s for multigene search (unions)
 +
** No significant slowdown when scale from 10m to 175m triples
 +
* Text matching and case insensitive search
 +
** Problems with using SPARQL regex filter, the only mechanism for case-insensitive search in SPARQL
 +
** Pre-generated lower-case gene names and loaded into the FlyBase RDF DB
 +
** Tried with [http://www.openlinksw.com/virtuoso/ OpenLink Virtuoso], still ~10 seconds for a case-insensitive search
 +
 
 +
Jun used [http://openflydata.org OpenFlyData] to:
 +
* Search by gene, gene expression mashup: ([http://www.openflydata.org/flyui/build/apps/expressionmashup/ go])
 +
* Search gene expression by gene batch ([http://www.openflydata.org/flyui/build/apps/expressionbygenebatch/ go])
 +
* Search gene expression by tissue expression profile ([http://www.openflydata.org/flyui/build/apps/byexpressionprofile/ go])
 +
 
 +
=== Open-BioMed ===
 +
 
 +
Jun also described a second effort, [http://www.open-biomed.org.uk/ Open-BioMed], that uses the same technologies to [http://esw.w3.org/topic/HCLSIG/AlternativeMedicineUseCase/ connect knowledge about alternative medicine and western drugs].  Open-BioMed demonstrate the value of [http://linkeddata.org/ Linked Data], and shows a novel technique for creating interlinks between datasets on a large scale.  This is a joint effort of the [http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup BioRDF] and [http://esw.w3.org/topic/HCLSIG/LODD LODD (Linked Open Drug Data]) task forces of the [http://esw.w3.org/topic/HCLSIG World Wide Web Consortium (W3C) Health Care Life Science Interest Group].  Jun used Open-BioMed to Search for herbs associated with a particular disease.
 +
 
 +
=== RDF & SPARQL: Benefits & Risks ===
 +
 
 +
Some identified benefits:
 +
* RDF provides a uniform and flexible data model
 +
** RDF dump is cheaper and quicker
 +
** Maintaining a separate SPARQL endpoint for each data source makes it easier than a data warehouse approach for handling data updates
 +
* RDF facilitates data re-use and re-purposing
 +
* SPARQL raises the point of departure for an application
 +
** Expressive, open-ended query protocol
 +
** Support for unanticipated queries
 +
 
 +
and risks:
 +
* Mapping data to RDF requires expertise and experience
 +
* Expressive query protocol is a double-edged sword
 +
* Performance is good for some queries, not for others ...
 +
 
 +
== GMOD in the Trenches ==
 +
[[File:Aug2009Stephen.JPG|right||Stephen Taylor]]
 +
 
 +
''Stephen Taylor, [http://www.molbiol.ox.ac.uk/CBRG_home.shtml Computational Biology Research Group], University of Oxford, [ftp://ftp.gmod.org/pub/gmod/Meetings/2009/August/GMODInTheTrenches.pdf PDF]''
 +
 
 +
The [http://www.molbiol.ox.ac.uk/CBRG_home.shtml Computational Biology Research Group (CBRG] provides bioinformatics support to researchers at the University of Oxford.  They are heavy GMOD users and have used [[GBrowse]], [[Citrina]], [[BioMart]], and [[Apollo]] (along with [[Artemis]]).
 +
 
 +
=== GBrowse at CBRG ===
 +
 
 +
Back in 2004, the CBRG wanted to pull data together to make a lab resource, and the genome is a useful data organiser.  The CBRG evaluated these platforms: UCSC, Ensembl, AceDB, and GBrowse.  Each had advantages and disadvantages, but GBrowse looked like it was built to be distributed and used elsewhere.  Ease of installation and were not a priority for the others.
 +
 
 +
The CBRG now supports over 50 different GBrowse databases.  Data is mainly [http://gbrowse.molbiol.ox.ac.uk/cgi-bin/gbrowse/HUMAN_HG18 human], [http://gbrowse.molbiol.ox.ac.uk/cgi-bin/gbrowse/MOUSE_M37 mouse] or bacterial, and data types include time series, arrays, and ChIP-on-Chip.  They visualize a lot of [[Next Generation Sequencing]] data, including histone modifications, ChIP-Seq, cis/trans interaction data, PCR amplified regions, and RNA-Seq.
 +
 
 +
The CBRG actively manages data flow to its GBrowse instances.  Each production GBrowse instance has a matching development instance where updates and changes are staged and tested before pushing them to production.  They also use ''core'' and ''satellite'' databases.  Core databases are built for human and mouse using public source data.  To meet individual groups' needs they then clone a core database, load custom data that is specific to that group, and then run a script to merge the core and satellite GBrowse configuration files.  They use Apache to restrict access to the satellite instances.
 +
 
 +
The CBRG strives to encourage power users.  Data is available for download, and they have regular meetings to discuss best practices.
 +
 
 +
=== Extending GBrowse ===
 +
 
 +
In the future they would like to use GBrowse as a workbench.  To do this they need flexible ways to import and export features.  For example, you can define a temporary track by uploading a GFF3 file, or by connecting via DAS to an outside source, or to another GBrowse.  It would be nice to have a method to ''commit'' a temporary track and make it permanent.  This requires some sort of user authentication.
 +
 
 +
Steve also walked through and example of how it would be useful to support querying and visualize data from multiple loci at the same time.
 +
 
 +
=== Make Existing GBrowse More Useful to External Developers ===
 +
 
 +
Finally Steve listed these 5 ways to make GBrowse more useful to external developers:
 +
 
 +
* Document general structure of GBrowse perl modules
 +
* Tips on debugging
 +
* Document / define API
 +
* Central Glyph page
 +
* Include a copy of BioPerl inside GBrowse
 +
 
 +
== A DBIx Class layer for Chado ==
 +
 
 +
[[File:Sgn small tag.png|right|link=User:RobertBuels|Sol Genomics Network]]
 +
 
 +
''[[User:Scott|Scott Cain]], OICR, for [[User:RobertBuels|Robert Buels]], [[:Category:SGN|Sol Genomics Network (SGN)]], [http://gmod.org/dbic_chado_slides/start.html S5 Slides]''
 +
 
 +
[[Chado]] needs [[:Category:Middleware|middleware]], a layer of software between the application (''e.g.'' a website) and the database.  Chado's flexible design makes for complex queries and a steep learning curve.  It is also hard to get good performance.  This talk introduces a Perl DBIx::Class layer for use with Chado, which can be used as the basis for many applications, including the next generation of [[Modware]].
 +
 
 +
{{CPAN|DBIx::Class}} is an object-relational mapping framework for Perl, and is the ''de facto''.  It has powerful features for:
 +
* query building (the magic of chainable ResultSets)
 +
* cross-database deployment (using SQL::Translator in the backend)
 +
* testing with Fixtures
 +
 
 +
Middleware can help by storing and/or automating complex queries,
 +
codifying best practices with both code and unified, high-level
 +
documentation.  Some performance optimizations can be put in
 +
middleware and it can assist in creating indexes and materialized
 +
views.
 +
 
 +
 
 +
The [http://github.com/rbuels/dbic_chado/ Bio-Chado-Schema] project has been
 +
set up by [[User:RobertBuels|Robert Buels]], with source control at [http://github.com/rbuels/dbic_chado/ GitHub], and releases available on [http://search.cpan.org/perldoc?Bio::Chado::Schema CPAN].  This contains DBIx::Class modules for every Chado table that should work with all database platforms that are supported by Chado.  The project uses automated tools to keep the modules in sync with changes in the Chado schema.  The project is currently actively looking for development help, CPAN releases are currently intended for developers.  Future goals include API support for common querying and loading patterns, interoperation with [[BioPerl]] objects, forming the basis for a future version of [[Modware]], and more.
 +
 
 +
Rob says:
 +
* other people should start building features onto and into it
 +
** and do some of the other things on the slides
 +
* make a new version of [[Modware]] based on it
 +
* do you think somebody could get funding to work on it full time?
 +
 
 +
== [http://code.google.com/p/gbol GMOD Biological Object Layer] ==
 +
[[File:Aug2009Ed.JPG|right|link=User:Elee|Ed Lee]]
 +
 
 +
''[[User:Elee|Ed Lee]], [http://www.berkeleybop.org/ BBOP], [[Media:Aug2009Gobol.pdf|PDF]]''
 +
 
 +
Ed has been working with E.O. Stinson and Robert Bruggner at BBOP, and [[User:RobinHouston|Robin Houston]] and Adrian Tivey at Sanger to create a Java based ''biological'' object layer (GBOL) for genomic features.
 +
 
 +
=== GBOL Architecture ===
 +
 
 +
GBOL is the top layer of a multilayer architecture:
 +
 
 +
; Biological Object Layer (GBOL)
 +
 
 +
This layer defines an object at a biological level of interest, say a ''gene''.  It aggregates together all of the information about that high level concept into a single, programmatically accessible entity.  It hides all of the information about how and where the underlying data is stored.
 +
 
 +
This layer is inspired by [[Chado]], but is not necessarily built on top of Chado.
 +
 
 +
; Biological Object/IO Layer
 +
 
 +
This layer ...
 +
 
 +
; Simple Object Layer
 +
 
 +
This layer knows about basic biological concepts, but does not directly know how or where this information is stored.
 +
 
 +
; Simple ObjectI/O Layer
 +
 
 +
This is the bottom layer of the stack and it is closely tied to how and where the data is stored.  This layer knows if it is talking to a [[Chado]] database, a [[GFF3]] file, or some other data source.
 +
 
 +
This layer can do simple aggregation such as "return all features in this range", but does not perform aggregation based on biological models.  That type of aggregation is performed by higher levels.
 +
 
 +
=== Biological Layer Configuration ===
 +
 
 +
<syntaxhighlight lang="xml"><?xml version="1.0" encoding="UTF-8"?>
 +
<gbol_mappings>
 +
<feature_mappings>
 +
  <type cv="SO" term="gene" default="true">
 +
  <read_class>Gene</read_class>
 +
  </type>
 +
  <type cv="SO" term="transcript" default="true">
 +
  <read_class>Transcript</read_class>
 +
  </type>
 +
  <type cv="SO" term=”my_transcript”>
 +
  <read_class>Transcript</read_class>
 +
  </type>
 +
  …
 +
</feature_mappings>
 +
<relationship_mappings>
 +
  <type cv="relationship" term="part_of" default="true">
 +
  <read_class>PartOf</read_class>
 +
  </type>
 +
  …
 +
</relationship_mappings>
 +
</gbol_mappings> </syntaxhighlight>
 +
 
 +
=== Future Developments ===
 +
 
 +
* Continued development on Biological layer
 +
* Inference of data: infer introns from exon structure
 +
* New format handlers: [[Chado XML]], GAME XML, [[BioPerl]] bridge
 +
* Configuration of common relationship variations such as ESTs aligned to the genome directly vs having a "match" feature
 +
 
 +
 
 +
==  A Restful interface for MODs ==
 +
''[[User:Jogoodma|Josh Goodman]], [[:Category:FlyBase|FlyBase]]
 +
 
 +
Josh talked about the progress of the [[GMOD REST API]] group that was started at the [[January 2009 GMOD Meeting]].
 +
 
 +
 
 +
== Quest for Standard: Sequence alignment/map format (SAM) and SAMtools ==
 +
[[File:Aug2009Heng.JPG|right||Heng Li]]
 +
 
 +
''[http://www.sanger.ac.uk/Users/lh3/ Heng Li], [http://www.sanger.ac.uk/ Wellcome Trust Sanger Institute], [[Media:Aug2009Sam.pdf|PDF]]''
 +
 
 +
Heng spoke about [http://samtools.sourceforge.net SAM/BAM and SAMtools], a platform agnostic set of file formats and programs for next generation sequence data.
 +
 
 +
SAM/BAM is a generic nucleotide alignment format that is
 +
* is simple to understand, easy to generate and easy to parse
 +
* is compact in file size
 +
* is streamable
 +
* supports fast random access
 +
 
 +
=== Quest for Standards ===
 +
 
 +
There had been no standardized and computationally efficient way to store the volumes of data that next generation sequence data.  Several formats such as phrap ACE and [[GFF]] existed but these were unable to scale up.
 +
 
 +
=== SAM Format ===
 +
 
 +
The [http://samtools.sourceforge.net/SAM1.pdf Sequence Alignment / Map (SAM) format] is motivated by short read alignment but also works with long reads and ''de novo'' assemblies.  SAM uses a GFF3-like tab-delimited format with 11 mandatory fields for key information, and variable optional fields and predefined tags for non-standard information.  It is designed to be simple to generate and to parse.  It uses an extended CIGAR string for various types of alignments.  The extended CIGAR string format adds support for ''clipped, spliced, multi-part'', and ''padded'' alignments.  See the [http://samtools.sourceforge.net/SAM1.pdf SAM Format Specification] for details.
 +
 
 +
=== BAM Format ===
 +
 
 +
SAM is a text format.  The Binary Alignment/Map (BAM) format is an exact binary representation of SAM. It has Zlib/gzip compatible compression (and can be decompressed by zlib/gzip).  BAM is space efficient, achieving 1 byte per raw base pair, including sequence, quality, read name, position and meta info.  BAM is also ''streamable:'' programs can process alignments without loading the entire alignment into memory.  BAM is usually sorted by the leftmost chromosomal position.  BAM is indexed, supports random access, and can quickly retrieve sequences overlapping a specified region.
 +
 
 +
BAM uses BGZF, a generic indexable compression format.  The standard gzip/zlib format is not block-wise. Indexing is intricate and inefficient. BGZF is separated into multiple standalone gzip/zlib blocks (64kB each).
 +
 
 +
BAM indexing uses binning plus linear index for alignments sorted by the leftmost coordinates. B-trees and pure linear indexes are inefficient for resolving ‘overlap’ queries. R-tree and pure binning indexes have difficultly in streaming.  For short read alignment, typically one seek function call for the retrieval of reads in a region (more efficient than R-trees).  Also produces small index files (''e.g.'', ~9MB for deep human resequencing)
 +
 
 +
=== APIs, Implementations and Supported Platforms ===
 +
 
 +
Several assembly programs can now produce SAM directly, and SAMtools comes with scripts to convert the output of several other assemblers to SAM format.
 +
 
 +
SAM also has native HTTP/FTP support.  Programs can retrieve alignments overlapping a specified region from a remote file via http/ftp.  Simply replace the input BAM file name with a URL (http/ftp only).  This partial load approach greatly reduces data transfer for applications such as genome browsers, that typically only need small regions of an assembly at any time.
 +
 
 +
Several implementations using [http://samtools.sourceforge.net SAMtools] are available.  The SAMtools package itself includes command line tools and C APIs for:
 +
* Conversion from other formats
 +
* SAM &hArr; BAM, indexing, sorting, merging, pileup, SNP/indel calling, alignment viewer ...
 +
* Native HTTP/FTP support
 +
 
 +
There are also implementations in Java ([http:picard.sourceforge.net Picard] and GATK), and Perl (Bio::DB::Sam, which is what [[GBrowse]] uses - see the next talk).
 +
 
 +
=== Displaying Alignments ===
 +
 
 +
An alignment viewer is a great help for method development:
 +
* Visually understand the alignment: the error rate, the depth, etc.
 +
* Validate aligner results: even read depth? right coordinates? right gaps?
 +
* Validate SNP/indel calls: human eyes are always better.
 +
* Validate structural variations: pair-end information
 +
 
 +
SAMtools comes with a [http://samtools.sourceforge.net/tview.shtml Text Alignment Viewer, tview] which uses the [http://www.gnu.org/software/ncurses/ GNU ncurses library].  tview retrieves alignments using FTP/HTTP and is fairly simple.  It shows alignments, but not annotation, paired-end information, multiple tracks, ...
 +
 
 +
The Broad Institute's Java-based [http://www.broadinstitute.org/igv/ Integrative Genomics Viewer (IGV)] also works with data in BAM format.
 +
 
 +
And you can view SAM/BAM in GBrowse using the Bio::DB::Sam Perl adaptor (based on
 +
[http://samtools.sourceforge.net/samtools/masterTOC.shtml SAMtools C APIs]). For SAM/BAM, GBrowse is a lightweight and versatile shared alignment viewer supporting mutliple tracks and gene annotations.
 +
 
 +
For GBrowse, SAM/BAM can provide an efficient way to access large-scale new sequencing data, store various types of alignment (EST, mRNA, etc.) as an alternative to SQL databases, and possibly ''realize distributed alignment resources''.  GBrowse already pulls in data from remote sources using DAS.  It could be extended to pull in remote SAM/BAN data using FTP/HTTP.
 +
 
 +
Are distributed alignments feasible?  There is already Native HTTP/FTP support in SAMtools.  This could be added to Bio::DB::Sam as well.  Alignment files are compressed.  For short reads, one seek call (establishing network connection) is required to get alignments in a region.  This would require very little configuration at the server hosting alignments, and compressed data transfer between file servers and the GBrowse server.
 +
 
 +
There are some major obstacles.  The index files have to sit on local disks at the GBrowse server, and matching the reference sequences may be an issue.  Also have to address bandwidth and caching.
 +
 
 +
 
 +
== Visualising NGS Data in GBrowse 2 ==
 +
''[[User:Clements|Dave Clements]], [http://nescent.org NESCent], [ftp://ftp.gmod.org/pub/gmod/Meetings/2009/August/Aug2009NGSinGBrowse.ppt PPT], [ftp://ftp.gmod.org/pub/gmod/Meetings/2009/August/Aug2009NGSinGBrowse.pdf PDF]''
 +
 
 +
[[User:Lstein|Lincoln Stein]] has written a [[GBrowse Adaptors|GBrowse adaptor]], Bio::DB::Sam, for [[Next Generation Sequencing]] data stored in the [http://samtools.sourceforge.net BAM format] that Heng Li [[#SAMtools for NextGen Sequence Data|described in his talk]].  This is currently in Alpha release, and works only with [[:Category:GBrowse 2|GBrowse 2]].  It is in available in the gbrowse-adaptors project of GMOD's [[CVS]] repository.  Short read, next generation sequence data can be directly represented in [[GFF3]], but the amount of data makes it very slow, and requires a very large database ti back it.  Using Bio::DB::Sam on top of BAM files makes visualizing individual reads both computationally tractable, and manageable.
 +
 
 +
The talk used an example of 4 ''E. coli'' strains: an ancestral strain for which a reference sequence is available, a manipulated strain, and then two strains with phage resistance that evolved from the manipulated strain.  Whole genome resequencing was performed on the manipulated and evolved lines.  The resequencing was done on an Illumina GA2 and then assembled with the [http://maq.sourceforge.org MAQ aligner].  The MAQ alignments were then converted to SAM using a SAMtools script, and then to BAM.
 +
 
 +
Dave then showed how to configure GBrowse to be a short read viewer using Bio::DB::Sam, including an example callback to show alignment quality using color.  However, the utility of showing short reads quickly declines as you zoom out past 100-200 bp.  You can also use to Bio::DB::Sam to show summary statistics such as coverage depth.  Dave will work on documenting the Bio:DB::Sam adaptor and it's interface to SAMtools in the coming months.
 +
 
 +
The talk then showed several other visualizations that can be done with next generation sequence data that don't display the short reads themselves.  This included a number of ways to show allele and genotype frequencies (including showing them on a geolocation map).
 +
 
 +
Finally, if you are planning on starting to use NGS data, make sure you have a lot of bioinformatics infrastructure in place first.
 +
 
 +
 
 +
== GBrowse: Lessons Learned and Statement of Interest ==
 +
[[File:Aug2009Erick.JPG|right|Erick]]
 +
''Erick Antezana & Frederic Potier, Bayer CropScience, [[Media:Aug2009GBrowse2ImplPersp.pdf|PDF]]''
 +
 
 +
=== History and Current GBrowse Infrastructure ===
 +
 
 +
Bayer CropScience uses [[GBrowse]] 1.70 and GBrowse 2, [[CMap]], [[Galaxy]], and [[Ergatis]].  They have been a GBrowse user since 2004.  They also evaluated [[Chado]] and chose not to use it because of performance issues.  Currently using GBrowse 2 and mainly Bio::DB:GFF databases, focused mainly on plants.  They have both publicly available plant genomes, private genomes, and increasingly frequent annotation updates.  Their requirements include minor data reformatting, fast data loading and querying, customizable application, and a high level of integrity.
 +
 
 +
Bayer currently has more than 30 databases with public data, at around 30GB.  Their in house data includes next generation sequence data (stored in BAM and accessed in GBrowse 2 via Bio::DB::Sam), genome annotation (stored in a Bio:DB:GFF database), molecular mapping visualized with [[CMap]].  They also considering supporting user annotation / manual curation with [[Apollo]] and/or [[Artemis]].  Their automated annotation workflow produces [[GFF]] and generates GBrowse configurations files.
 +
 
 +
Bayer has extended GBrowse in several ways, including user authentication, permissions, and tracking.
 +
 
 +
Also
 +
* On the fly visualization
 +
* Blast anchoring/Sequence homology search
 +
** blast homologies are uploaded as user annotations
 +
* Plugins
 +
** data export
 +
** links to in house applications
 +
* In house keyword search engine
 +
** fast search utility
 +
** cross databases search
 +
* Gateway
 +
** centralised access point
 +
 
 +
=== Statement of Interest: Requirements and Needs ===
 +
 
 +
Bayer CropScience would also like to see GMOD extended in a number of areas.
 +
 
 +
==== GBrowse Database [[GBrowse Adaptors|Adaptors]] ====
 +
 
 +
* '''NGS adaptor''' (Bio::DB::Sam) is a key priority
 +
* '''Memory adaptor''' would like to be able to specify a file name or a complete path via a parameter so, the adaptor doesn't need to load all the GFF files in the directory
 +
* '''[[Chado]] adaptor'''  Portability to Oracle; ability to store user-specific annotation / manual curation; a system track versions and history of the annotations; and management of user access rights
 +
* '''SeqFeature::Store''' Portability to Oracle (c.f. user access rights via VPD) and faster loading time.
 +
* '''Compatibility with other genome browsers databases''' for instance ensembl databases?
 +
 
 +
==== GBrowse User Interaction ====
 +
 
 +
* Authentication
 +
** To track user sessions
 +
** To enable user access rights management
 +
* User Annotation Management
 +
** To store the user annotations in a database or in a file on the server.  Thus the users will be able to get their annotations while getting connected to different machines
 +
** To send automatically user’s annotations to GBrowse via a URL parameter
 +
* Integration with [[CMap]]
 +
 
 +
==== GBrowse Configuration Files ====
 +
 
 +
Current format is error prone, difficult to debug, has a steep learning curve, and is time consuming to maintain.  Bayer (and [[#GMOD in the Trenches|CBRG]] and modENCODE and ...) partially works around this by having scripts generate their configuration files.
 +
 
 +
A ''better'' solution would be to have a better representation of the configuration file, [[Glossary#XML|XML]] for instance.  (''[[JBrowse]] addresses this issue by using [[Glossary#JSON|JSON]] for its configuration files - [[User:Clements|Dave]]'')
 +
 
 +
Would also like the ability to configure the global layout to enable/disable components such as disable the custom tracks or display settings components.
 +
 
 +
Would also like to have a standardized way to specify metadata in the configuration files.  For example, species and assembly versions:
 +
 
 +
<syntaxhighlight lang="perl">#################################
 +
# database definitions
 +
#################################
 +
 
 +
[TAIR_Arabidopsis_V8:database]
 +
db_adaptor        = Bio::DB::GFF
 +
db_args            = -adaptor DBI::mysql
 +
                    -dsn dbi:mysql:TAIR_Arabidopsis_V8
 +
species            = Arabidopsis thaliana
 +
assembly.source    = TAIR
 +
assembly.version  = 8
 +
annotation.source  = TAIR
 +
annotation.version = 8</syntaxhighlight>
 +
 
 +
==== Metadata Web Services ====
 +
 
 +
Web services could be used to query and report on metadata such as: list of reference sequences, annotation version, assembly version, list of available feature types,
 +
 
 +
Suggestion:
 +
<syntaxhighlight lang="xml"><browser>
 +
  <species>Arabidopsis</species>
 +
  <assembly>bayer</assembly>
 +
  <annotation>1.0</annotation>
 +
  <reference-sequence>chr1</reference-sequence>
 +
  <reference-sequence>chr2</reference-sequence>
 +
  <feature-type>fgenesh:mRNA</feature-type>
 +
  <feature-type>splign:mRNA</feature-type>
 +
</browser></syntaxhighlight>
 +
This information could be defined in the config file:
 +
<syntaxhighlight lang="perl">[TAIR_Arabidopsis_V8:database]
 +
db_adaptor    = Bio::DB::GFF
 +
db_args      = -adaptor DBI::mysql
 +
                -dsn dbi:mysql:TAIR_Arabidopsis_V8
 +
species=Arabidopsis thaliana
 +
assembly.source=TAIR
 +
assembly.version=8
 +
annotation.source=TAIR
 +
annotation.version=8</syntaxhighlight>
 +
 
 +
=== Conclusion / Discussion ===
 +
 
 +
GBrowse 2 is a tool that can be used in a production environment.  It is intensively used within the Bayer Bioinformatics platform to facility a high level data integration.  It is easy to maintain.
 +
 
 +
Our priorities for further developments:
 +
* Adaptors performance
 +
* Need to focus on user interaction
 +
* GBrowse.conf representation
 +
* Native integration of other GMOD tools (e.g. [[CMap]])
 +
 
 +
== [[JBrowse]] ==
 +
[[File:Aug2009Ian.JPG|right|Ian Holmes]]
 +
''Ian Holmes, University of California - Berkeley, [[Media:Aug2009JBrowse.pdf|PDF]]''
 +
 
 +
Some useful links:
 +
* The [[JBrowse]] paper will be published in [http://genome.cshlp.org Genome Research] in the September 2009 issues.  An advanced access version is [http://genome.cshlp.org/content/early/2009/08/03/gr.094607.109.abstract available online].
 +
* All things JBrowse are available at [http://jbrowse.org JBrowse.org]
 +
 
 +
JBrowse was initially going to look and feel very much like [[GBrowse]], but with pre-rendered, tiled images, a la Google Maps.  A prototype was built, but this approach did not scale:
 +
 
 +
: ''D. melanogaster'' at pixel resolution is an order of magnitude wider than the continental US.
 +
 
 +
Prerendering also prohibits things like user uploaded data.  The original approach was abandoned and JBrowse now uses JavaScript based client side rendering.  This approach is several orders of magnitude faster to generate the tracks, and takes several orders of magnitude less disk space to store them.
 +
 
 +
JBrowse uses ''[http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/11/1386 nested containment lists (NCList)]'' to store features.  This approach is 5-500 times faster than competing methods such as R-trees, and B-trees with binning.
 +
 
 +
Ian demonstrated a [http://twiki.org TWiki] plugin for JBrowse that demonstrated an easy way for users to upload their own tracks.
 +
 
 +
Some "imminent" developments for JBrowse:
 +
* Lazily-loaded NCLists
 +
* Text autocompletion; “proper” search
 +
* Nextgen sequence data
 +
** Start with basic summarization, then custom tracks
 +
* Community annotation
 +
** Persistent upload & sharing of tracks
 +
** Editing/curation over the web (ackles...)
 +
* Documented image-track API
 +
* Synteny browser (''c.f.'' [[GBrowse_syn]])
 +
* Much more at jbrowse.lighthouseapp.com
 +
 
 +
Ian closed with a very strong acknowledgment of [[User:MitchSkinner|Mitch Skinner's]] contribution to this work.
 +
 
 +
 
 +
== [[GBrowse_syn]] ==
 +
[[File:Aug2009Sheldon.JPG|right|link=User:Mckays|Sheldon McKay]]
 +
''[[User:Mckays|Sheldon McKay]], [http://cshl.edu Cold Spring Harbor Laboratory (CSHL)], [[Media:Aug2009GBrowse_syn.pdf|PDF]]''
 +
 
 +
A [[synteny]] browser had display elements in common with a genome browsers.  They use sequence alignments, orthology or co-linearity data to highlight different genomes, strains, etc., and they usually displays co-linearity relative to a reference genome.
 +
 
 +
 
 +
=== Other GMOD Synteny Viewers ===
 +
 
 +
GMOD has several supported [[synteny]] browsers, in addition to [[GBrowse_syn]]:
 +
 
 +
; [[SynView]]
 +
 
 +
SynView is an add-on to native GBrowse package.  It uses [[GFF3]] or [[DAS]]1 compliant data adapters.  GFF requires special tags (but they are allowed by the spec).  Reference panel appears on the top.
 +
 
 +
; [[SynBrowse]]
 +
 
 +
SynBrowse uses the same core libraries as [[GBrowse]].  Uses the Bio::DB::GFF ([[GFF2]]) adaptor.  The GFF uses standard 'Target' syntax.  It currently supports only two species.
 +
 
 +
; [[Sybil]]
 +
 
 +
Sybil is not [[GBrowse]]-based.  It uses a [[Chado]] database as a backend and provides whole genome and detailed views.
 +
 
 +
; [[CMap]]
 +
 
 +
CMap is a comparative map viewer and can be used to show alignments between markers and regions on any type of map.
 +
 
 +
; [[Apollo]]
 +
 
 +
Apollo (and [[Artemis]] too) provides an embedded synteny viewer.
 +
 
 +
=== GBrowse_syn ===
 +
 
 +
GBrowse_syn is different from the other browsers in a number of ways:
 +
* Does not rely on perfect co-linearity across the entire displayed region (no orphan alignments)
 +
* Offers on the fly alignment chaining
 +
* No upward limit on the number of species
 +
* Used grid lines to trace fine-scale sequence gain/loss
 +
* Seamless integration with [[GBrowse]] data sources
 +
* Ongoing support and development
 +
* ''Some people think it looks nice''
 +
 
 +
[[GBrowse_syn]] is part of the [[GBrowse]] distribution.  It uses native (GBrowse-compliant) [[GFF2]]/[[GFF3]] or [[Chado]] adapters for individual species' data, and stores synteny data are stored in a [[GBrowse syn Database|separate joining database]].  The databases form a hub and spoke (or star), with the joining database at the hub, and the individual species databases as the spokes.
 +
 
 +
At run time, GBrowse_syn reads the species databases, the joining/alignment database, and configuration files for each species and an overall config file.
 +
 
 +
=== Where do I get data for GBrowse_syn? ===
 +
 
 +
'''''You''' have to make it.''
 +
 
 +
GBrowse_syn helps you visualize multiple sequence alignment data, but it does not generate it for you.  This is a non-trivial task and is not for the faint of heart.  Sheldon provided a high level overview of one possible process and possible tools you could use in that process.
 +
<div class="quotebox">
 +
<center>
 +
{| cellpadding="5" style="font-size: 140%"
 +
|-
 +
| colspan="4" | Raw genomic sequences
 +
|-
 +
| <span style="font-size: 160%">&dArr;</span>
 +
| Step:
 +
 
 +
ex. tools:
 +
| Mask repeats
 +
 
 +
RepeatMasker, Tandem Repeats Finder, nmerge
 +
|-
 +
| <span style="font-size: 160%">&dArr;</span>
 +
| Step:
 +
 
 +
ex. tools:
 +
| Identify orthologous regions
 +
 
 +
ENREDO, MERCATOR, orthocluster
 +
| <span style="font-size: 160%">&rArr;</span>&nbsp;[[GBrowse_syn]]
 +
|-
 +
| <span style="font-size: 160%">&dArr;</span>
 +
| Step:
 +
 
 +
ex. tools:
 +
| Nucleotide-level alignment
 +
 
 +
PECAN, MAVID
 +
| <span style="font-size: 160%">&rArr;</span>&nbsp;[[GBrowse_syn]]
 +
|-
 +
| <span style="font-size: 160%">&dArr;</span>
 +
| Step:
 +
| Further processing
 +
|-
 +
| colspan="4" | [[GBrowse]]
 +
|}
 +
</center>
 +
</div>
 +
 
 +
Once you have the data, you need to get it into a format that is supported by the GBrowse_syn load scripts.
 +
 
 +
=== Using GBrowse_syn ===
 +
 
 +
GBrowse_syn's user interface looks very much like [[GBrowse]]'s interface.  After selecting a reference assembly, GBrowse_syn displays each aligned sequence as a track, with every other track being the reference assembly.  Aligned regions can be shown with and without connecting ribbons.  Ribbons are ''twisted'' to indicate strand reversal.  Strands can also be reversed in the display to ''untwist'' the ribbons.  Alignment ribbons can be shown with or without embedded grid lines.  Grid lines show a finer level of alignment than plain ribbons, allowing the user to easily identify regions with indels, and to visualize gene structure evolution or gene loss.  They also require nucleotide level alignment.
 +
 
 +
GBrowse_syn can show the same breadth of features as GBrowse.  However, for a clearer display, users are strongly encourage to limit what they show.  As in GBrowse, arbitrary annotations can be added to any feature and shown with popups or linked pages.
 +
 
 +
GBrowse_syn also provides direct visual feedback on the likely quality of assemblies and can be used for guidance on refining them.  For closely related species, regions in the reference should like to only a few regions in the other sequences.  If it links to many different regions, the assembly likely needs significant additional work.
 +
 
 +
If all you have is orthology data, GBrowse_syn can show that.  However, the utility of GBrowse_syn declines if the aligned sequences are too far apart.  It does faithfully show the results of the alignment, but the visualization often highlights that the alignments are of poor quality.
 +
 
 +
Finally, if your alignment data has regions aligning to ''multiple'' regions in other species, say because of recent duplications, GBrowse_syn will visualize this correctly.
 +
 
 +
=== Future Developments ===
 +
 
 +
* Integration with GBrowse 2.0
 +
* "On the fly" sequence alignment view
 +
* AJAX-based user interface and navigation
 +
* High-level graphical overviews
 +
 
 +
 
 +
== [http://gbrowse.org GBrowse.org] ==
 +
[[File:Aug2009Alessandra.JPG|right|link=User:Bilardi|Alessandra]]
 +
''[[User:Bilardi|Alessandra Bilardi]], [http://www.cribi.unipd.it/ CRIBI Biotech Center Padua University], [[Media:Aug2009GBrowseOrg.pdf|PDF]]''
 +
 
 +
Alessandra created [http://gbrowse.org GBrowse.org] to facilitate exchange of data, configuration files, and best practices between [[GBrowse]] users.  The web site links to GBrowse instances and data download pages.  It is based on the [http://mediawiki.org MediaWiki wiki package] and makes extensive use of category tags to make information accessible in many different ways.
 +
 
 +
GBrowse.org is updated through a mixture of automated and manual mechanisms.  Entrez' EFetch utility is used to initially create pages with their genome sequencing status.  Each organism's page includes links to browsers, downloades, and sites and pages about that organism.  If information is available on how the sequence and annotation data was produced then that is included as well.
 +
 
 +
GBrowse.org is not limited to just [[GBrowse]] sites.  It also links to Ensembl, UCSC, and several other browser types.
 +
 
 +
Future plans for GBrowse.org include:
 +
* complete automations
 +
* test and edit links
 +
* edit sequencing and annotation methods
 +
* generate GBrowses and pages about all genomes with sequencing completed
 +
* divide GBrowses and genome pages in different sites ([http://gbrowse.org/wiki/index.php/Survey optional])
 +
 
 +
Finally, if you have a GBrowse site, you are encouraged to notify [[User:Bilardi|Alessandra]] for inclusion on [http://gbrowse.org GBrowse.org].
 +
 
 +
 
 +
== [[DAS]] update ==
 +
[[File:Aug2009Jonathan.JPG|right]]
 +
''Jonathan Warren, Sanger Institute, [[Media:Aug2009DASUpdate.ppt|PPT]], [[Media:Aug2009DASUpdate.pdf|PDF]]''
 +
 
 +
Jonathan started with an introduction to [[DAS]].  DAS:
 +
* Stops us from suffering under too much data to manage.
 +
* Allows us to download annotations for regions of interest rather than for whole genomes or databases,
 +
* Allows data providers to be in control of their annotations displayed to the world and can keep them up to date for users.
 +
 
 +
DAS stands for '''D'''istributed '''A'''nnotation '''S'''ystem. It allows data providers to provide their data over the web in a common format. It is based on HTTP and [[Glossary#XML|XML]].  [[Apollo]] and [[GBrowse]], and many other popular packages, can speak DAS.  DAS client programs request a list of DAS sources, and can then request regions of interest from those sources.
 +
 
 +
=== DAS 1.6E ===
 +
 
 +
DAS has a couple of versions.  DAS was originally published in 2001.  Over the years the DAS standard bifurcated into the DAS 1.x and DAS2 lines.  DAS 1.x has proved more popular than DAS2.  Current standard is 1.53E, but a DAS1.6E standard came out of a workshop in March 2009.  DAS 1.6E is expected to provide the functionality that many DAS2 users desired.  1.6 spec has new features and is a consolidation of the way DAS is being used.  1.6E has extensions being developed.
 +
 
 +
Some DAS 1.5/1.6 Commands: Sources, Features, Sequence, types, Stylesheet, Structure
 +
Alignment, and Interaction.
 +
 
 +
Some extensions in DAS 1.6:
 +
* Represent features with more than two levels
 +
* Reliably relate feature types to a more structured ontology.
 +
* Identify when two DAS servers are using the same coordinate system.
 +
* A standard way to create and edit DAS features.
 +
* Verification of DAS servers for standards compliance.
 +
 
 +
=== DAS Registry ===
 +
 
 +
The [http://www.dasregistry.org DAS Registry] is increasing validation capability of the registry for 1.53E and upcoming 1.6E spec.  A [http://www.dasregistry.org/validation/sources.rng RelaxNG schema] has been created to support this.
 +
 
 +
=== Current and Future Work ===
 +
 
 +
* More validation (headers and feature by id).
 +
* Capability of bulk uploading/mirroring DAS sources to Registry (sources cmd).
 +
** Adding all of ensembl genomes (bacteria and viruses) as DAS sources and to the registry.
 +
* Completing the 1.6 spec - hierarchies, nextFeature.
 +
* Updating client libraries and servers to work with both 1.53 and 1.6 spec
 +
* New user interface to the registry for faster searching using Lucene - also limited version available from Sanger and EBI sites.
 +
* Greater support for ontologies-give me all das sources that provide genes?
 +
 
 +
=== ''Some'' Implementations ===
 +
 
 +
{| class="wikitable"
 +
! DAS Libraries
 +
! DAS Servers
 +
! DAS Clients
 +
|-
 +
|
 +
* PERL
 +
** Proserver, LDAS - servers
 +
** Bio::Das::Lite - client library
 +
* Java
 +
** Dazzle, MyDAS - servers
 +
** Dasobert - client library
 +
|
 +
* Affymetrix
 +
* BioSapiens servers
 +
* Ensembl server
 +
* KEGG DAS
 +
* Sanger DAS server
 +
* EBI Genomic DAS server
 +
* EBI Protein DAS server
 +
* Uniprot DAS server
 +
* TIGR's listing of servers
 +
* UCSC server
 +
|
 +
*  Ensembl
 +
* Spice
 +
* Dasty
 +
* Pfam
 +
* STRAP
 +
* DASher
 +
|}
 +
 
 +
 
 +
== [[InterMine]] update ==
 +
[[File:Aug2009Julie.JPG|right|Julie]]
 +
''Julie Sullivan''
 +
 
 +
Some bullet points from Julie's talk on [[InterMine]]:
 +
 
 +
* InterMine has RESTful web services
 +
* Web service can return HTML.
 +
* [http://flymine.org FlyMine] started in 2002.  5 developers, release about 10 times a year.
 +
 
 +
=== Mines4Mods ===
 +
 
 +
The Mines4Mods project started May 2009.  It is a 2 year grant.  [[:Category:RGD|RGD]], [[:Category:SGD|SGD]], and [http://zfin.org ZFIN] are all participating.  Each has half a developer working on it.  The project is aiming for interoperability between InterMine instances.  Hope to port results from one InterMine to another, and then use it in a query in its new location.
 +
 
 +
 
 +
== Show and Tell, Discussion ==
 +
 
 +
Daniel Sobral and Baptiste Brault of INRA Versailles demonstrated the Aniseed website, particularly the anatomy and gene expression atlas parts of it.  Aniseed is currently in the process of converting their schema to [[Chado]] and is planning on making their web interface available to the GMOD community.
 +
 
 +
= Agenda Suggestions =
 +
 
 +
If you have items that you would like to discuss (or be discussed) at this meeting, please add them here.
 +
 
 +
* Project update
 +
* [[GMOD REST API]] - Presentation of the current draft spec and a feedback/discussion period. --[[User:Jogoodma|Jogoodma]]
 +
* [[DAS]]: Current Situation and Developments -- JWarren
 +
* [[InterMine]] - update and new project with [[:Category:SGD|SGD]], [[:category:RGD|RGD]] and [http://zfin.org ZFIN] -- Julie Sullivan
 +
* [[Next Generation Sequencing]] - methods for viewing NGS data in [[GBrowse]] -- Steve Taylor
 +
* [[GBrowse]] 1.70 and 2.0 releases
 +
* [[JBrowse]] 1.0 release
 +
* [[GBrowse_syn]] update
 +
* [[Apollo]] plans for the next generation of Apollo
 +
* [[Chado]] 1.1 release
 +
* GMOD for Metagenomics? -- [[User:Clements|Clements]]
 +
* Assembly life cycle management for beginners --[[User:DanBolser|DanBolser]] 14:04, 20 July 2009 (UTC)
 +
* Linked Data for GMOD Databases - making GMOD databases machine-accessible to achieve better data sharing and reuse for bioinformaticians and application developers -- [[User:JunZhao|Jun Zhao]]
 +
* [http://code.google.com/p/gbol GMOD Biological Object Layer] framework
 +
 
 +
= Location =
 +
 
 +
The meeting was held at the [http://www.mstc.ox.ac.uk/ Medical Science Teaching Centre (MSTC)] at the [http://www.ox.ac.uk/ University of Oxford], in [http://www.oxfordcity.co.uk/ Oxford, United Kingdom].
 +
 
 +
= Lodging =
  
 
See the [[GMOD Europe 2009#Lodging|Lodging section]] of the [[GMOD Europe 2009]] page for information on lodging for both the [[2009 GMOD Summer School - Europe|summer school]] and this meeting.
 
See the [[GMOD Europe 2009#Lodging|Lodging section]] of the [[GMOD Europe 2009]] page for information on lodging for both the [[2009 GMOD Summer School - Europe|summer school]] and this meeting.
  
== Agenda Suggestsion ==
+
= Cost and Registration =
 +
 
 +
The cost was &pound;50, which included a catered lunch on Friday.  Space was limited to the first 50 people to register.
 +
 
 +
= Mailing List =
 +
 
 +
The meeting has a mailing list that all meeting related correspondence will be sent to:
 +
 
 +
: [mailto:august2009gmodmeeting@gmod.org august2009gmodmeeting@gmod.org]
 +
 
 +
Any meeting participant can send an email to the list.
 +
 
 +
= Sponsor =
 +
 
 +
[[File:Cbrg.jpg|right|300px|link=http://www.molbiol.ox.ac.uk/|CBRG]]
 +
We would like to thank the [http://www.molbiol.ox.ac.uk/ Computational Biology Research Group (CBRG)] of the University of Oxford for hosting and financially supporting the week's events.
 +
 
 +
''I would particularly like to thank Stephen Taylor, Simon McGowan and Zong-Pei Han for their help and support during the entire week of [[GMOD Europe 2009]].  We could not have done this without you. -- [[User:Clements|Dave C.]]''
 +
 
 +
= Attendees =
 +
 
 +
{| class="wikitable sortable"
 +
! First Name
 +
! Last Name
 +
! Affiliation
 +
|-
 +
| Ambrose || Andongabo || Rothamsted Research
 +
|-
 +
| ERICK || ANTEZANA || BAYER BIOSCIENCE NV
 +
|-
 +
| T. Grant || Belgard || MRC FGU
 +
|-
 +
| [[User:Bilardi|Alessandra]] || [[User:Bilardi|Bilardi]] || [http://genomics.cribi.unipd.it/ CRIBI] - University of Padova
 +
|-
 +
| [[User:DanBolser|Dan]] || [[User:DanBolser|Bolser]] || Dundee University
 +
|-
 +
| Baptiste || Brault || INRA Versailles
 +
|-
 +
| Tim || Burgis || Imperial College- London
 +
|-
 +
| [[User:Scott|Scott]] || [[User:Scott|Cain]] || Ontario Institute for Cancer Research
 +
|-
 +
| Maria || Cartolano || University of Oxford
 +
|-
 +
| [[User:Clements|Dave]] || [[User:Clements|Clements]] || [http://nescent.org NESCent]
 +
|-
 +
| Ros || Cutts || Imperial College
 +
|-
 +
| Etienne P || de Villiers || ILRI
 +
|-
 +
| Phil || East  || Cancer Research UK
 +
|-
 +
| Matt || Eldridge || Cancer Research UK- Cambridge Research Institute
 +
|-
 +
| Ben || Elsworth || University of Edinburgh
 +
|-
 +
| [[User:Jogoodma|Josh]] || [[User:Jogoodma|Goodman]] || FlyBase (Indiana University)
 +
|-
 +
| Cyprien || GUERIN || INRA
 +
|-
 +
| Zong-Pei || Han || Computational Biology Research Group, Oxford
 +
|-
 +
| Andreas || Heger || MRC FGU
 +
|-
 +
| Ian || Holmes || UC Berkeley
 +
|-
 +
| Jim || Hughes || MRC
 +
|-
 +
| Bernd || Jagla || Institut Pasteur
 +
|-
 +
| Baptiste || Laporte || IBDML
 +
|-
 +
| [[User:Elee|Ed]] || [[User:Elee|Lee]] || Lawrence Berkeley National Laboratory
 +
|-
 +
| Jacob || Lemieux || Computational Biology Research Group
 +
|-
 +
| Siu-wai || Leung || University of Macau
 +
|-
 +
| Christopher || Love || Rothamsted Research
 +
|-
 +
| Emanuele || Marchi || University of Oxford
 +
|-
 +
| Simon  || McGowan || Computational Biology Research Group, Oxford
 +
|-
 +
| [[User:Mckays|Sheldon]] || [[User:Mckays|McKay]] || Cold Spring Harbor Laboratory
 +
|-
 +
| FREDERIC || POTIER || BAYER BIOSCIENCE NV
 +
|-
 +
| Peter || Rice || European Bioinformatics Institute
 +
|-
 +
| Kim || Rutherford || University of Cambridge
 +
|-
 +
| michelle || simon || Medical Research Council
 +
|-
 +
| Daniel || Sobral || IBDML
 +
|-
 +
| Aengus || Stewart || London Research Institute CRUK
 +
|-
 +
| Julie || Sullivan || InterMine- Dept of Genetics- Cambridge
 +
|-
 +
| Steve || Taylor || Computational Biology Research Group, Oxford
 +
|-
 +
| Adrian || Tivey || Wellcome Trust Sanger Institute
 +
|-
 +
| [[User:Buggy|Giles]] || [[User:Buggy|Velarde]] || Welcome Trust Sanger Institute
 +
|-
 +
| Pieter Emiel || Ver Loren van Themaat || Macx Planck Institute for Plant Breeding Research
 +
|-
 +
| Jonathan || Warren || The Sanger Institue
 +
|-
 +
| Xikun || Wu || Institute for Animal Health
 +
|-
 +
| [[User:JunZhao|Jun]] || [[User:JunZhao|Zhao]] || University of Oxford
 +
|-
 +
| Pinglei || Zhou || Harvard University/FlyBase
 +
|-
 +
|}
 +
 
 +
= Feedback =
 +
 
 +
Attendees were asked to provide feedback at the end of the meeting.
 +
 
 +
 
 +
'''Q: Would you recommend [[Meetings|GMOD meetings]] to others'''
 +
 
 +
{| class="wikitable"
 +
! Yes
 +
! Maybe
 +
! No
 +
|-
 +
| '''100%'''
 +
| 0%
 +
| 0%
 +
|}
 +
 
 +
 
 +
'''Q: Please rate the meeting(s) using the following scale: 1 (not at all) to 3 (reasonably) to 5 (exceptionally).'''
 +
 
 +
{| class="wikitable" style="text-align: right"
 +
!
 +
! 1
 +
! 2
 +
! 3
 +
! 4
 +
! 5
 +
|-
 +
! How useful was the meeting?
 +
| 0%
 +
| 0%
 +
| 23%
 +
! 53%
 +
| 23%
 +
|-
 +
! Was the meeting well run and organized?
 +
| 0%
 +
| 0%
 +
| 18%
 +
! 47%
 +
| 35%
 +
|}
 +
 
 +
 
 +
'''Q: Was the meeting what you expected?'''
 +
 
 +
{| class="wikitable"
 +
! No.
 +
! Yes.
 +
! Yes!
 +
|-
 +
| 0%
 +
! 86%
 +
| 14%
 +
|}
 +
 
 +
Longer responses:
 +
* Yes of course! The meeting was really interesting!
 +
* yes and it was good to for me to meet the developers
 +
* Yes, pretty much. It was in part this time just a good way to meet up with particular collaborators.
 +
* Yes, but I was hoping to learn more about Chado
 +
* Very very useful.
 +
 
 +
 
 +
'''Q: Which [[#Presentations|presentations and sessions]] at this meeting were the most useful or interesting?'''
 +
* [[#Quest for Standard: Sequence alignment/map format (SAM) and SAMtools|SAMtools]] and updated on GMOD tools also user presentations
 +
* [[#GMOD Biological Object Layer|GMOD Biological Object Layer]], [[#JBrowse|JBrowse]]
 +
* I was really interested in the [[#Visualising NGS Data in GBrowse 2|NGS integration and display on gbrowse 2]], especially the reads representation and population genotypes. Thanks Dave!
 +
* NGS visualization in [[#Visualising NGS Data in GBrowse 2|GBrowse]]/[[#JBrowse|JBrowse]], [[#Quest for Standard: Sequence alignment/map format (SAM) and SAMtools|SAMtools]]
 +
* [[#Linked Data for GMOD Databases|Linked Data for GMOD Databases]], [[#A DBIx Class layer for Chado|A DBIx Class layer for Chado]], [[#GMOD Biological Object Layer|GMOD Biological Object Layer]], [[#Quest for Standard: Sequence alignment/map format (SAM) and SAMtools|SAMtools for NextGen Sequence Data]], [[#Visualising NGS Data in GBrowse 2|GBrowse 2, NextGen, PopGen]]
 +
* [[#JBrowse|JBrowse]], [[#GBrowse_syn|GBrowse_syn]]
 +
* Interesting presentations on [[#Quest for Standard: Sequence alignment/map format (SAM) and SAMtools|SAMtools]], [[#JBrowse|JBrowse]] and [[#GBrowse_syn|GBrowse_syn]]. Web services.  discussions were also interesting
 +
* Chado and GBrowse
 +
* [[#GMOD in the Trenches|GMOD from the Trenches]]; [[#Quest for Standard: Sequence alignment/map format (SAM) and SAMtools|SAMtools]]; [[#Visualising NGS Data in GBrowse 2|Dave's GBrowse2]]
 +
* [[#A Restful interface for MODs|restful services]], [[#JBrowse|JBrowse]], [[#Linked Data for GMOD Databases|semantic web]]
 +
* [[#Presentations|all]] useful and interesting
 +
* [[#Quest for Standard: Sequence alignment/map format (SAM) and SAMtools|Heng Li]], [[#JBrowse|Ian Holmes]], [[#Linked Data for GMOD Databases|Jun Zhao]]
 +
* GBrowse and [[#JBrowse|JBrowse]] updates
 +
* The talks about GBrowse and the talks about next gen sequencing.
 +
* GBrowse2, [[#JBrowse|JBrowse]], [[#Quest for Standard: Sequence alignment/map format (SAM) and SAMtools|SAMtools]], [[#InterMine update|InterMine]]
 +
* [[#GMOD Biological Object Layer|GMOD Biological Object Layer]], [[#JBrowse|JBrowse]]
 +
 
 +
 
 +
'''Q: Do you have suggestions for improving GMOD meetings in the future?'''
 +
 
 +
* Another one in Europe please. We could host one in Hinxton but I am prepared to travel
 +
* I was able to come to the meeting because it was in Europe, so more meetings in Europe would be very helpful
 +
* Maybe some people can present posters during Coffee Breaks for the next GMOD meeting.
 +
* more sessions
 +
* Less instruction copying, more problem solving
 +
* no
 +
* I do think a informal or formal drinks or meal in the evening is a good idea, even if it's just - 'we are going to this pub to get a meal' which delegates can go to or not and then pay for themselves?
 +
* Better time keeping
 +
* Somewhere drier ;-) Seriously, it didn't seem to have the energy of some of the other 2 I've been to - maybe me or maybe people tired from the [[2009 GMOD Summer School - Europe|course]]
 +
* Try encouraging outsiders to bring non-genomic information to GMOD E.g. people from [http://www.bdgp.org/ BDGP], [http://zfin.org ZFIN expression data], [http://4dx.embl.de/4DXpress/ 4Dxpress], [http://bgee.unil.ch/bgee/bgee BGee], ''etc''...
  
If you have items that you would like to discuss (or have discussed) at this meeting, please add them here.
 
  
* First suggestion, ...
+
'''Additional feedback, suggestions, criticism, and praise.'''
  
== Sponsors ==
+
* This is the first time ever to learn to make use of so many useful bioinformatics tools from the developers and experts of them.
 +
* Thanks for the meeting.
 +
* Thanks very much to the organisers for their hard work - I definitely thought it was worth it
  
{{ImageRight|Cbrg.jpg|CBRG|300|http://www.molbiol.ox.ac.uk/}}
+
[[File:Jan2010MtgLogo170.png|right|link=January 2010 GMOD Meeting|January 2010 GMOD Meeting]]
We would like to thank the [http://www.molbiol.ox.ac.uk/ Computational Biology Research Group (CBRG)] of the University of Oxford for hosting and sponsoring the week's events.
+
= Next Meeting: January 2010 in San Diego California =
  
We would also welcome additional sponsors to help reduce the cost of registration.  Please contact the [mailto:help@gmod.org GMOD Help Desk] if you are interested.
+
The next [[Meetings|GMOD Community Meeting]] was held [[January 2010 GMOD Meeting|January 14-15, 2010 in San Diego, California, United States]], immediately following [[PAG 2010]].
  
 
[[Category:Meetings]]
 
[[Category:Meetings]]
 +
[[Category:BioPerl]]
 +
[[Category:Chado]]
 +
[[Category:Chado Presentations]]
 +
[[Category:Comparative Genomics]]
 +
[[Category:DAS]]
 +
[[Category:GBrowse 2]]
 +
[[Category:GBrowse]]
 +
[[Category:GBrowse syn]]
 +
[[Category:InterMine]]
 +
[[Category:Java]]
 +
[[Category:Middleware]]
 +
[[Category:Presentations]]
 +
[[Category:REST]]
 +
[[Category:Semantic web]]
 +
[[Category:JBrowse]]

Latest revision as of 19:37, 4 September 2013

August 2009 GMOD Meeting
6-7 August, 2009
Oxford UK

Part of GMOD Europe 2009, five days of GMOD including a GMOD Summer School


August 2009 GMOD Meeting

GMOD Europe 2009


This GMOD Community Meeting was held 6-7 August, 2009, in Oxford UK. The meeting was a part of GMOD Europe 2009, a week long event that also included a GMOD Summer School. This is the first time a GMOD meeting has been held in Europe.

As with previous GMOD meetings, this meeting had a mixture of project talks, component talks, and user talks. The agenda was driven by attendee suggestions. The two previous meetings were the January 2009 and July 2008 meetings. GMOD meetings are an excellent way to meet GMOD developers and users, and to learn (and affect) what's coming in the project.


Contents

Schedule

Heng Li
Wellcome Trust Sanger Institute


Dr Heng Li of the Sanger Institute was the special guest speaker. Heng discussed his recent work on SAMtools, a set of file formats and scripts for efficiently storing and accessing next generation sequence data. Heng is a developer on several projects focused on next generation sequencing, including SAMtools, BWA, and MAQ.
Date Time Session Link(s)
Thursday
6 August
8:30-12:00 Last half day of 2009 GMOD Summer School - Europe
13:30-14:30 Scott Cain - Introductions and the State of GMOD Prezi, PPT, PDF, Summary
14:30-15:00 Dave Clements - GMOD Help Desk Stuff PPT, PDF, Summary
15:00-15:30 Jun Zhao - Linked Data for GMOD Databases PDF, Summary
15:30-15:45 Coffee Break
15:45-16:15 Steve Taylor - GMOD in the Trenches PDF, Summary
16:15-16:30 Scott Cain (for Robert Buels) - A DBIx::Class layer for Chado S5 Slides, Summary
16:30-17:00 Ed Lee - GMOD Biological Object Layer PDF, Summary
17:00-17:30 Josh Goodman - A Restful interface for MODs Summary
17:30 Dinner (on your own)
Friday
7 August
8:45-9:15 Heng Li - Quest for Standard: Sequence alignment/map format (SAM) and SAMtools PDF, Summary
9:15-9:45 Dave Clements - Visualising NGS Data in GBrowse 2 PPT, PDF, Summary
9:45-10:15 Erick Antezana & Frederic Potier - GBrowse: Lessons Learned and Statement of Interest PDF, Summary
10:15-11:45 Ian Holmes - JBrowse PDF, Summary
10:45-11:00 Coffee Break
11:00-11:30 Sheldon McKay - GBrowse_syn PDF, Summary
11:30-12:30 Discussion: NextGen data and GMOD: What do we do (and not do)?
12:30-13:30 Catered Lunch
13:30-14:00 Alessandra Bilardi - GBrowse.org PDF, Summary
14:00-14:30 Jonathan Warren - DAS update PPT, PDF, Summary
14:30-15:00 Julie Sullivan - InterMine update Summary
15:00-18:00 Show and Tell, Discussion Summary

Presentations

GMOD Project Talks

Scott

Scott Cain, Ontario Institute for Cancer Research, Intro, What's New PPT, PDF
Dave Clements, NESCent, Help Desk Update, PPT, PDF

HHMI Science Education Alliance

The Howard Hughes Medical Instutute's Science Education Alliance (SEA) is using GMOD tools to teach annotation to college freshmen. They isolate and sequence phage samples. The sequence is then stored in Chado, annotated with Apollo and visualized with GBrowse. In production at 12 colleges across the US.

What's new

Chado (GMOD) 1.1 is coming
  • Minor schema changes
  • Minor fixes to GFF scripts
  • Addition of Chris Mungall's script to create views based on CV terms.
GBrowse
  • GBrowse 2
    • Distributed databases and render servers
    • AJAX track loading
    • Improved configuration management
    Next Generation Sequencing in GBrowse
    • Support for SAM/BAM databases - see SAMtools
    • Coverage XY-plot, Confidence density plot, Individual alignments, Paired reads
    • Currently Alpha in GBrowse 2; may work in GBrowse 1 with some DAS magic.
    Circular chromosome support
    • Can scroll through origin, and features can span origin
    • Coming in GBrowse 1.71
    • Developed by Nathan Liles of the EcoliWiki project.
JBrowse
  • Another complete rearchitecture
  • Uses AJAX for client side rendering
GBrowse_syn
  • Distributed with GBrowse 1.70
  • Makes use of data adaptors/databases that GBrowse uses
Tripal
DIYA

DIYA is a gene prediction pipeline for prokaryotes. It complements MAKER, a pipeline for eukaryotes. DIYA is actually a generic, lightweight pipeline framework which was initially built to produce gene predictions. DIYA is becoming part of GMOD.

Atlases and Aniseed

Aniseed is converting its schema to Chado. One of Aniseed's particular strengths is atlases for expression, anatomy, and cell fate. They are extending Chado to better support atlases, and will also make their web front end available as a part of GMOD.

GMOD Summer School

2008

2008 GMOD Summer School - first school ever offered

2009

2009 GMOD Summer School - Americas

  • 4 days at NESCent
  • 8 GMOD Components covered; 9 instructors
  • 52 applications for 25 slots

2009 GMOD Summer School - Europe

  • 3 1/2 days at University of Oxford
  • 7 GMOD Components covered; 10 instructors
  • 58 applications for 25 slots

That's an over 350% increase in interest from 2008.

We'll do another summer school at NESCent in 2010. We are also considering one in Asia/Pacific in 2010.

Outreach

GMOD Community Surveys

GMOD is now surveying the community every year. The 2008 GMOD Community Survey had 89 and is very informative about how GMOD is used. The 2009 survey will be in October

Upcoming GMOD Hackathon ?

There may be a GMOD hackathon this coming spring (March to May) at US National Evolutionary Synthesis Center (NESCent) in Durham, NC, USA. If this happens the focus will be on extending GMOD for evolutionary biology. Contact Dave if you want to be on organizing committee or participate.

Linked Data for GMOD Databases

Jun Zhao

Jun Zhao, Department of Zoology, University of Oxford, PDF

Jun first introduced the Resource Description Framework (RDF) and the SPARQL language for querying it.

OpenFlyData

Jun discussed her group's efforts to build an RDF triple store from several very different data sources: FlyBase (a Chado database), BDGP, FlyTED, FlyAtlas, and Affymetrix data sources. The integrated triple-store can be accessed at OpenFlyData.

They used

  • D2RQ mapping to load FlyBase and BDGP, using conservative mapping with minimum interpretation.
  • OAI2SPARQL to harvest N3 RDF metadata via the OAI-PMH protocol, using built-in support by Eprints, and further info from ESWC2008 paper.
  • Custom Python program to get FlyAtlas data.

Some performance numbers:

  • Loading: Our datasets ~175 million triples
  • Querying:
    • Good enough for real time user interaction, e.g., <1s for single gene search, 1-4s for multigene search (unions)
    • No significant slowdown when scale from 10m to 175m triples
  • Text matching and case insensitive search
    • Problems with using SPARQL regex filter, the only mechanism for case-insensitive search in SPARQL
    • Pre-generated lower-case gene names and loaded into the FlyBase RDF DB
    • Tried with OpenLink Virtuoso, still ~10 seconds for a case-insensitive search

Jun used OpenFlyData to:

  • Search by gene, gene expression mashup: (go)
  • Search gene expression by gene batch (go)
  • Search gene expression by tissue expression profile (go)

Open-BioMed

Jun also described a second effort, Open-BioMed, that uses the same technologies to connect knowledge about alternative medicine and western drugs. Open-BioMed demonstrate the value of Linked Data, and shows a novel technique for creating interlinks between datasets on a large scale. This is a joint effort of the BioRDF and LODD (Linked Open Drug Data) task forces of the World Wide Web Consortium (W3C) Health Care Life Science Interest Group. Jun used Open-BioMed to Search for herbs associated with a particular disease.

RDF & SPARQL: Benefits & Risks

Some identified benefits:

  • RDF provides a uniform and flexible data model
    • RDF dump is cheaper and quicker
    • Maintaining a separate SPARQL endpoint for each data source makes it easier than a data warehouse approach for handling data updates
  • RDF facilitates data re-use and re-purposing
  • SPARQL raises the point of departure for an application
    • Expressive, open-ended query protocol
    • Support for unanticipated queries

and risks:

  • Mapping data to RDF requires expertise and experience
  • Expressive query protocol is a double-edged sword
  • Performance is good for some queries, not for others ...

GMOD in the Trenches

Stephen Taylor

Stephen Taylor, Computational Biology Research Group, University of Oxford, PDF

The Computational Biology Research Group (CBRG provides bioinformatics support to researchers at the University of Oxford. They are heavy GMOD users and have used GBrowse, Citrina, BioMart, and Apollo (along with Artemis).

GBrowse at CBRG

Back in 2004, the CBRG wanted to pull data together to make a lab resource, and the genome is a useful data organiser. The CBRG evaluated these platforms: UCSC, Ensembl, AceDB, and GBrowse. Each had advantages and disadvantages, but GBrowse looked like it was built to be distributed and used elsewhere. Ease of installation and were not a priority for the others.

The CBRG now supports over 50 different GBrowse databases. Data is mainly human, mouse or bacterial, and data types include time series, arrays, and ChIP-on-Chip. They visualize a lot of Next Generation Sequencing data, including histone modifications, ChIP-Seq, cis/trans interaction data, PCR amplified regions, and RNA-Seq.

The CBRG actively manages data flow to its GBrowse instances. Each production GBrowse instance has a matching development instance where updates and changes are staged and tested before pushing them to production. They also use core and satellite databases. Core databases are built for human and mouse using public source data. To meet individual groups' needs they then clone a core database, load custom data that is specific to that group, and then run a script to merge the core and satellite GBrowse configuration files. They use Apache to restrict access to the satellite instances.

The CBRG strives to encourage power users. Data is available for download, and they have regular meetings to discuss best practices.

Extending GBrowse

In the future they would like to use GBrowse as a workbench. To do this they need flexible ways to import and export features. For example, you can define a temporary track by uploading a GFF3 file, or by connecting via DAS to an outside source, or to another GBrowse. It would be nice to have a method to commit a temporary track and make it permanent. This requires some sort of user authentication.

Steve also walked through and example of how it would be useful to support querying and visualize data from multiple loci at the same time.

Make Existing GBrowse More Useful to External Developers

Finally Steve listed these 5 ways to make GBrowse more useful to external developers:

  • Document general structure of GBrowse perl modules
  • Tips on debugging
  • Document / define API
  • Central Glyph page
  • Include a copy of BioPerl inside GBrowse

A DBIx Class layer for Chado

Sol Genomics Network

Scott Cain, OICR, for Robert Buels, Sol Genomics Network (SGN), S5 Slides

Chado needs middleware, a layer of software between the application (e.g. a website) and the database. Chado's flexible design makes for complex queries and a steep learning curve. It is also hard to get good performance. This talk introduces a Perl DBIx::Class layer for use with Chado, which can be used as the basis for many applications, including the next generation of Modware.

DBIx::Class is an object-relational mapping framework for Perl, and is the de facto. It has powerful features for:

  • query building (the magic of chainable ResultSets)
  • cross-database deployment (using SQL::Translator in the backend)
  • testing with Fixtures

Middleware can help by storing and/or automating complex queries, codifying best practices with both code and unified, high-level documentation. Some performance optimizations can be put in middleware and it can assist in creating indexes and materialized views.


The Bio-Chado-Schema project has been set up by Robert Buels, with source control at GitHub, and releases available on CPAN. This contains DBIx::Class modules for every Chado table that should work with all database platforms that are supported by Chado. The project uses automated tools to keep the modules in sync with changes in the Chado schema. The project is currently actively looking for development help, CPAN releases are currently intended for developers. Future goals include API support for common querying and loading patterns, interoperation with BioPerl objects, forming the basis for a future version of Modware, and more.

Rob says:

  • other people should start building features onto and into it
    • and do some of the other things on the slides
  • make a new version of Modware based on it
  • do you think somebody could get funding to work on it full time?

GMOD Biological Object Layer

Ed Lee

Ed Lee, BBOP, PDF

Ed has been working with E.O. Stinson and Robert Bruggner at BBOP, and Robin Houston and Adrian Tivey at Sanger to create a Java based biological object layer (GBOL) for genomic features.

GBOL Architecture

GBOL is the top layer of a multilayer architecture:

Biological Object Layer (GBOL)

This layer defines an object at a biological level of interest, say a gene. It aggregates together all of the information about that high level concept into a single, programmatically accessible entity. It hides all of the information about how and where the underlying data is stored.

This layer is inspired by Chado, but is not necessarily built on top of Chado.

Biological Object/IO Layer

This layer ...

Simple Object Layer

This layer knows about basic biological concepts, but does not directly know how or where this information is stored.

Simple ObjectI/O Layer

This is the bottom layer of the stack and it is closely tied to how and where the data is stored. This layer knows if it is talking to a Chado database, a GFF3 file, or some other data source.

This layer can do simple aggregation such as "return all features in this range", but does not perform aggregation based on biological models. That type of aggregation is performed by higher levels.

Biological Layer Configuration

<?xml version="1.0" encoding="UTF-8"?>
<gbol_mappings>
 <feature_mappings>
  <type cv="SO" term="gene" default="true">
   <read_class>Gene</read_class>
  </type>
  <type cv="SO" term="transcript" default="true">
   <read_class>Transcript</read_class>
  </type>
  <type cv="SO" term=”my_transcript”>
   <read_class>Transcript</read_class>
  </type></feature_mappings>
 <relationship_mappings>
  <type cv="relationship" term="part_of" default="true">
   <read_class>PartOf</read_class>
  </type></relationship_mappings>
</gbol_mappings>

Future Developments

  • Continued development on Biological layer
  • Inference of data: infer introns from exon structure
  • New format handlers: Chado XML, GAME XML, BioPerl bridge
  • Configuration of common relationship variations such as ESTs aligned to the genome directly vs having a "match" feature


A Restful interface for MODs

Josh Goodman, FlyBase

Josh talked about the progress of the GMOD REST API group that was started at the January 2009 GMOD Meeting.


Quest for Standard: Sequence alignment/map format (SAM) and SAMtools

Heng Li

Heng Li, Wellcome Trust Sanger Institute, PDF

Heng spoke about SAM/BAM and SAMtools, a platform agnostic set of file formats and programs for next generation sequence data.

SAM/BAM is a generic nucleotide alignment format that is

  • is simple to understand, easy to generate and easy to parse
  • is compact in file size
  • is streamable
  • supports fast random access

Quest for Standards

There had been no standardized and computationally efficient way to store the volumes of data that next generation sequence data. Several formats such as phrap ACE and GFF existed but these were unable to scale up.

SAM Format

The Sequence Alignment / Map (SAM) format is motivated by short read alignment but also works with long reads and de novo assemblies. SAM uses a GFF3-like tab-delimited format with 11 mandatory fields for key information, and variable optional fields and predefined tags for non-standard information. It is designed to be simple to generate and to parse. It uses an extended CIGAR string for various types of alignments. The extended CIGAR string format adds support for clipped, spliced, multi-part, and padded alignments. See the SAM Format Specification for details.

BAM Format

SAM is a text format. The Binary Alignment/Map (BAM) format is an exact binary representation of SAM. It has Zlib/gzip compatible compression (and can be decompressed by zlib/gzip). BAM is space efficient, achieving 1 byte per raw base pair, including sequence, quality, read name, position and meta info. BAM is also streamable: programs can process alignments without loading the entire alignment into memory. BAM is usually sorted by the leftmost chromosomal position. BAM is indexed, supports random access, and can quickly retrieve sequences overlapping a specified region.

BAM uses BGZF, a generic indexable compression format. The standard gzip/zlib format is not block-wise. Indexing is intricate and inefficient. BGZF is separated into multiple standalone gzip/zlib blocks (64kB each).

BAM indexing uses binning plus linear index for alignments sorted by the leftmost coordinates. B-trees and pure linear indexes are inefficient for resolving ‘overlap’ queries. R-tree and pure binning indexes have difficultly in streaming. For short read alignment, typically one seek function call for the retrieval of reads in a region (more efficient than R-trees). Also produces small index files (e.g., ~9MB for deep human resequencing)

APIs, Implementations and Supported Platforms

Several assembly programs can now produce SAM directly, and SAMtools comes with scripts to convert the output of several other assemblers to SAM format.

SAM also has native HTTP/FTP support. Programs can retrieve alignments overlapping a specified region from a remote file via http/ftp. Simply replace the input BAM file name with a URL (http/ftp only). This partial load approach greatly reduces data transfer for applications such as genome browsers, that typically only need small regions of an assembly at any time.

Several implementations using SAMtools are available. The SAMtools package itself includes command line tools and C APIs for:

  • Conversion from other formats
  • SAM ⇔ BAM, indexing, sorting, merging, pileup, SNP/indel calling, alignment viewer ...
  • Native HTTP/FTP support

There are also implementations in Java ([http:picard.sourceforge.net Picard] and GATK), and Perl (Bio::DB::Sam, which is what GBrowse uses - see the next talk).

Displaying Alignments

An alignment viewer is a great help for method development:

  • Visually understand the alignment: the error rate, the depth, etc.
  • Validate aligner results: even read depth? right coordinates? right gaps?
  • Validate SNP/indel calls: human eyes are always better.
  • Validate structural variations: pair-end information

SAMtools comes with a Text Alignment Viewer, tview which uses the GNU ncurses library. tview retrieves alignments using FTP/HTTP and is fairly simple. It shows alignments, but not annotation, paired-end information, multiple tracks, ...

The Broad Institute's Java-based Integrative Genomics Viewer (IGV) also works with data in BAM format.

And you can view SAM/BAM in GBrowse using the Bio::DB::Sam Perl adaptor (based on SAMtools C APIs). For SAM/BAM, GBrowse is a lightweight and versatile shared alignment viewer supporting mutliple tracks and gene annotations.

For GBrowse, SAM/BAM can provide an efficient way to access large-scale new sequencing data, store various types of alignment (EST, mRNA, etc.) as an alternative to SQL databases, and possibly realize distributed alignment resources. GBrowse already pulls in data from remote sources using DAS. It could be extended to pull in remote SAM/BAN data using FTP/HTTP.

Are distributed alignments feasible? There is already Native HTTP/FTP support in SAMtools. This could be added to Bio::DB::Sam as well. Alignment files are compressed. For short reads, one seek call (establishing network connection) is required to get alignments in a region. This would require very little configuration at the server hosting alignments, and compressed data transfer between file servers and the GBrowse server.

There are some major obstacles. The index files have to sit on local disks at the GBrowse server, and matching the reference sequences may be an issue. Also have to address bandwidth and caching.


Visualising NGS Data in GBrowse 2

Dave Clements, NESCent, PPT, PDF

Lincoln Stein has written a GBrowse adaptor, Bio::DB::Sam, for Next Generation Sequencing data stored in the BAM format that Heng Li described in his talk. This is currently in Alpha release, and works only with GBrowse 2. It is in available in the gbrowse-adaptors project of GMOD's CVS repository. Short read, next generation sequence data can be directly represented in GFF3, but the amount of data makes it very slow, and requires a very large database ti back it. Using Bio::DB::Sam on top of BAM files makes visualizing individual reads both computationally tractable, and manageable.

The talk used an example of 4 E. coli strains: an ancestral strain for which a reference sequence is available, a manipulated strain, and then two strains with phage resistance that evolved from the manipulated strain. Whole genome resequencing was performed on the manipulated and evolved lines. The resequencing was done on an Illumina GA2 and then assembled with the MAQ aligner. The MAQ alignments were then converted to SAM using a SAMtools script, and then to BAM.

Dave then showed how to configure GBrowse to be a short read viewer using Bio::DB::Sam, including an example callback to show alignment quality using color. However, the utility of showing short reads quickly declines as you zoom out past 100-200 bp. You can also use to Bio::DB::Sam to show summary statistics such as coverage depth. Dave will work on documenting the Bio:DB::Sam adaptor and it's interface to SAMtools in the coming months.

The talk then showed several other visualizations that can be done with next generation sequence data that don't display the short reads themselves. This included a number of ways to show allele and genotype frequencies (including showing them on a geolocation map).

Finally, if you are planning on starting to use NGS data, make sure you have a lot of bioinformatics infrastructure in place first.


GBrowse: Lessons Learned and Statement of Interest

Erick

Erick Antezana & Frederic Potier, Bayer CropScience, PDF

History and Current GBrowse Infrastructure

Bayer CropScience uses GBrowse 1.70 and GBrowse 2, CMap, Galaxy, and Ergatis. They have been a GBrowse user since 2004. They also evaluated Chado and chose not to use it because of performance issues. Currently using GBrowse 2 and mainly Bio::DB:GFF databases, focused mainly on plants. They have both publicly available plant genomes, private genomes, and increasingly frequent annotation updates. Their requirements include minor data reformatting, fast data loading and querying, customizable application, and a high level of integrity.

Bayer currently has more than 30 databases with public data, at around 30GB. Their in house data includes next generation sequence data (stored in BAM and accessed in GBrowse 2 via Bio::DB::Sam), genome annotation (stored in a Bio:DB:GFF database), molecular mapping visualized with CMap. They also considering supporting user annotation / manual curation with Apollo and/or Artemis. Their automated annotation workflow produces GFF and generates GBrowse configurations files.

Bayer has extended GBrowse in several ways, including user authentication, permissions, and tracking.

Also

  • On the fly visualization
  • Blast anchoring/Sequence homology search
    • blast homologies are uploaded as user annotations
  • Plugins
    • data export
    • links to in house applications
  • In house keyword search engine
    • fast search utility
    • cross databases search
  • Gateway
    • centralised access point

Statement of Interest: Requirements and Needs

Bayer CropScience would also like to see GMOD extended in a number of areas.

GBrowse Database Adaptors

  • NGS adaptor (Bio::DB::Sam) is a key priority
  • Memory adaptor would like to be able to specify a file name or a complete path via a parameter so, the adaptor doesn't need to load all the GFF files in the directory
  • Chado adaptor Portability to Oracle; ability to store user-specific annotation / manual curation; a system track versions and history of the annotations; and management of user access rights
  • SeqFeature::Store Portability to Oracle (c.f. user access rights via VPD) and faster loading time.
  • Compatibility with other genome browsers databases for instance ensembl databases?

GBrowse User Interaction

  • Authentication
    • To track user sessions
    • To enable user access rights management
  • User Annotation Management
    • To store the user annotations in a database or in a file on the server. Thus the users will be able to get their annotations while getting connected to different machines
    • To send automatically user’s annotations to GBrowse via a URL parameter
  • Integration with CMap

GBrowse Configuration Files

Current format is error prone, difficult to debug, has a steep learning curve, and is time consuming to maintain. Bayer (and CBRG and modENCODE and ...) partially works around this by having scripts generate their configuration files.

A better solution would be to have a better representation of the configuration file, XML for instance. (JBrowse addresses this issue by using JSON for its configuration files - Dave)

Would also like the ability to configure the global layout to enable/disable components such as disable the custom tracks or display settings components.

Would also like to have a standardized way to specify metadata in the configuration files. For example, species and assembly versions:

#################################
# database definitions
#################################
 
[TAIR_Arabidopsis_V8:database]
db_adaptor         = Bio::DB::GFF
db_args            = -adaptor DBI::mysql
                     -dsn dbi:mysql:TAIR_Arabidopsis_V8
species            = Arabidopsis thaliana
assembly.source    = TAIR
assembly.version   = 8
annotation.source  = TAIR
annotation.version = 8

Metadata Web Services

Web services could be used to query and report on metadata such as: list of reference sequences, annotation version, assembly version, list of available feature types,

Suggestion:

<browser>
  <species>Arabidopsis</species>
  <assembly>bayer</assembly>
  <annotation>1.0</annotation>
  <reference-sequence>chr1</reference-sequence>
  <reference-sequence>chr2</reference-sequence>
  <feature-type>fgenesh:mRNA</feature-type>
  <feature-type>splign:mRNA</feature-type>
</browser>

This information could be defined in the config file:

[TAIR_Arabidopsis_V8:database]
db_adaptor    = Bio::DB::GFF
db_args       = -adaptor DBI::mysql
                -dsn dbi:mysql:TAIR_Arabidopsis_V8
species=Arabidopsis thaliana
assembly.source=TAIR
assembly.version=8
annotation.source=TAIR
annotation.version=8

Conclusion / Discussion

GBrowse 2 is a tool that can be used in a production environment. It is intensively used within the Bayer Bioinformatics platform to facility a high level data integration. It is easy to maintain.

Our priorities for further developments:

  • Adaptors performance
  • Need to focus on user interaction
  • GBrowse.conf representation
  • Native integration of other GMOD tools (e.g. CMap)

JBrowse

Ian Holmes

Ian Holmes, University of California - Berkeley, PDF

Some useful links:

JBrowse was initially going to look and feel very much like GBrowse, but with pre-rendered, tiled images, a la Google Maps. A prototype was built, but this approach did not scale:

D. melanogaster at pixel resolution is an order of magnitude wider than the continental US.

Prerendering also prohibits things like user uploaded data. The original approach was abandoned and JBrowse now uses JavaScript based client side rendering. This approach is several orders of magnitude faster to generate the tracks, and takes several orders of magnitude less disk space to store them.

JBrowse uses nested containment lists (NCList) to store features. This approach is 5-500 times faster than competing methods such as R-trees, and B-trees with binning.

Ian demonstrated a TWiki plugin for JBrowse that demonstrated an easy way for users to upload their own tracks.

Some "imminent" developments for JBrowse:

  • Lazily-loaded NCLists
  • Text autocompletion; “proper” search
  • Nextgen sequence data
    • Start with basic summarization, then custom tracks
  • Community annotation
    • Persistent upload & sharing of tracks
    • Editing/curation over the web (ackles...)
  • Documented image-track API
  • Synteny browser (c.f. GBrowse_syn)
  • Much more at jbrowse.lighthouseapp.com

Ian closed with a very strong acknowledgment of Mitch Skinner's contribution to this work.


GBrowse_syn

Sheldon McKay

Sheldon McKay, Cold Spring Harbor Laboratory (CSHL), PDF

A synteny browser had display elements in common with a genome browsers. They use sequence alignments, orthology or co-linearity data to highlight different genomes, strains, etc., and they usually displays co-linearity relative to a reference genome.


Other GMOD Synteny Viewers

GMOD has several supported synteny browsers, in addition to GBrowse_syn:

SynView

SynView is an add-on to native GBrowse package. It uses GFF3 or DAS1 compliant data adapters. GFF requires special tags (but they are allowed by the spec). Reference panel appears on the top.

SynBrowse

SynBrowse uses the same core libraries as GBrowse. Uses the Bio::DB::GFF (GFF2) adaptor. The GFF uses standard 'Target' syntax. It currently supports only two species.

Sybil

Sybil is not GBrowse-based. It uses a Chado database as a backend and provides whole genome and detailed views.

CMap

CMap is a comparative map viewer and can be used to show alignments between markers and regions on any type of map.

Apollo

Apollo (and Artemis too) provides an embedded synteny viewer.

GBrowse_syn

GBrowse_syn is different from the other browsers in a number of ways:

  • Does not rely on perfect co-linearity across the entire displayed region (no orphan alignments)
  • Offers on the fly alignment chaining
  • No upward limit on the number of species
  • Used grid lines to trace fine-scale sequence gain/loss
  • Seamless integration with GBrowse data sources
  • Ongoing support and development
  • Some people think it looks nice

GBrowse_syn is part of the GBrowse distribution. It uses native (GBrowse-compliant) GFF2/GFF3 or Chado adapters for individual species' data, and stores synteny data are stored in a separate joining database. The databases form a hub and spoke (or star), with the joining database at the hub, and the individual species databases as the spokes.

At run time, GBrowse_syn reads the species databases, the joining/alignment database, and configuration files for each species and an overall config file.

Where do I get data for GBrowse_syn?

You have to make it.

GBrowse_syn helps you visualize multiple sequence alignment data, but it does not generate it for you. This is a non-trivial task and is not for the faint of heart. Sheldon provided a high level overview of one possible process and possible tools you could use in that process.

Raw genomic sequences
Step:

ex. tools:

Mask repeats

RepeatMasker, Tandem Repeats Finder, nmerge

Step:

ex. tools:

Identify orthologous regions

ENREDO, MERCATOR, orthocluster

 GBrowse_syn
Step:

ex. tools:

Nucleotide-level alignment

PECAN, MAVID

 GBrowse_syn
Step: Further processing
GBrowse

Once you have the data, you need to get it into a format that is supported by the GBrowse_syn load scripts.

Using GBrowse_syn

GBrowse_syn's user interface looks very much like GBrowse's interface. After selecting a reference assembly, GBrowse_syn displays each aligned sequence as a track, with every other track being the reference assembly. Aligned regions can be shown with and without connecting ribbons. Ribbons are twisted to indicate strand reversal. Strands can also be reversed in the display to untwist the ribbons. Alignment ribbons can be shown with or without embedded grid lines. Grid lines show a finer level of alignment than plain ribbons, allowing the user to easily identify regions with indels, and to visualize gene structure evolution or gene loss. They also require nucleotide level alignment.

GBrowse_syn can show the same breadth of features as GBrowse. However, for a clearer display, users are strongly encourage to limit what they show. As in GBrowse, arbitrary annotations can be added to any feature and shown with popups or linked pages.

GBrowse_syn also provides direct visual feedback on the likely quality of assemblies and can be used for guidance on refining them. For closely related species, regions in the reference should like to only a few regions in the other sequences. If it links to many different regions, the assembly likely needs significant additional work.

If all you have is orthology data, GBrowse_syn can show that. However, the utility of GBrowse_syn declines if the aligned sequences are too far apart. It does faithfully show the results of the alignment, but the visualization often highlights that the alignments are of poor quality.

Finally, if your alignment data has regions aligning to multiple regions in other species, say because of recent duplications, GBrowse_syn will visualize this correctly.

Future Developments

  • Integration with GBrowse 2.0
  • "On the fly" sequence alignment view
  • AJAX-based user interface and navigation
  • High-level graphical overviews


GBrowse.org

Alessandra

Alessandra Bilardi, CRIBI Biotech Center Padua University, PDF

Alessandra created GBrowse.org to facilitate exchange of data, configuration files, and best practices between GBrowse users. The web site links to GBrowse instances and data download pages. It is based on the MediaWiki wiki package and makes extensive use of category tags to make information accessible in many different ways.

GBrowse.org is updated through a mixture of automated and manual mechanisms. Entrez' EFetch utility is used to initially create pages with their genome sequencing status. Each organism's page includes links to browsers, downloades, and sites and pages about that organism. If information is available on how the sequence and annotation data was produced then that is included as well.

GBrowse.org is not limited to just GBrowse sites. It also links to Ensembl, UCSC, and several other browser types.

Future plans for GBrowse.org include:

  • complete automations
  • test and edit links
  • edit sequencing and annotation methods
  • generate GBrowses and pages about all genomes with sequencing completed
  • divide GBrowses and genome pages in different sites (optional)

Finally, if you have a GBrowse site, you are encouraged to notify Alessandra for inclusion on GBrowse.org.


DAS update

Aug2009Jonathan.JPG

Jonathan Warren, Sanger Institute, PPT, PDF

Jonathan started with an introduction to DAS. DAS:

  • Stops us from suffering under too much data to manage.
  • Allows us to download annotations for regions of interest rather than for whole genomes or databases,
  • Allows data providers to be in control of their annotations displayed to the world and can keep them up to date for users.

DAS stands for Distributed Annotation System. It allows data providers to provide their data over the web in a common format. It is based on HTTP and XML. Apollo and GBrowse, and many other popular packages, can speak DAS. DAS client programs request a list of DAS sources, and can then request regions of interest from those sources.

DAS 1.6E

DAS has a couple of versions. DAS was originally published in 2001. Over the years the DAS standard bifurcated into the DAS 1.x and DAS2 lines. DAS 1.x has proved more popular than DAS2. Current standard is 1.53E, but a DAS1.6E standard came out of a workshop in March 2009. DAS 1.6E is expected to provide the functionality that many DAS2 users desired. 1.6 spec has new features and is a consolidation of the way DAS is being used. 1.6E has extensions being developed.

Some DAS 1.5/1.6 Commands: Sources, Features, Sequence, types, Stylesheet, Structure Alignment, and Interaction.

Some extensions in DAS 1.6:

  • Represent features with more than two levels
  • Reliably relate feature types to a more structured ontology.
  • Identify when two DAS servers are using the same coordinate system.
  • A standard way to create and edit DAS features.
  • Verification of DAS servers for standards compliance.

DAS Registry

The DAS Registry is increasing validation capability of the registry for 1.53E and upcoming 1.6E spec. A RelaxNG schema has been created to support this.

Current and Future Work

  • More validation (headers and feature by id).
  • Capability of bulk uploading/mirroring DAS sources to Registry (sources cmd).
    • Adding all of ensembl genomes (bacteria and viruses) as DAS sources and to the registry.
  • Completing the 1.6 spec - hierarchies, nextFeature.
  • Updating client libraries and servers to work with both 1.53 and 1.6 spec
  • New user interface to the registry for faster searching using Lucene - also limited version available from Sanger and EBI sites.
  • Greater support for ontologies-give me all das sources that provide genes?

Some Implementations

DAS Libraries DAS Servers DAS Clients
  • PERL
    • Proserver, LDAS - servers
    • Bio::Das::Lite - client library
  • Java
    • Dazzle, MyDAS - servers
    • Dasobert - client library
  • Affymetrix
  • BioSapiens servers
  • Ensembl server
  • KEGG DAS
  • Sanger DAS server
  • EBI Genomic DAS server
  • EBI Protein DAS server
  • Uniprot DAS server
  • TIGR's listing of servers
  • UCSC server
  • Ensembl
  • Spice
  • Dasty
  • Pfam
  • STRAP
  • DASher


InterMine update

Julie

Julie Sullivan

Some bullet points from Julie's talk on InterMine:

  • InterMine has RESTful web services
  • Web service can return HTML.
  • FlyMine started in 2002. 5 developers, release about 10 times a year.

Mines4Mods

The Mines4Mods project started May 2009. It is a 2 year grant. RGD, SGD, and ZFIN are all participating. Each has half a developer working on it. The project is aiming for interoperability between InterMine instances. Hope to port results from one InterMine to another, and then use it in a query in its new location.


Show and Tell, Discussion

Daniel Sobral and Baptiste Brault of INRA Versailles demonstrated the Aniseed website, particularly the anatomy and gene expression atlas parts of it. Aniseed is currently in the process of converting their schema to Chado and is planning on making their web interface available to the GMOD community.

Agenda Suggestions

If you have items that you would like to discuss (or be discussed) at this meeting, please add them here.

Location

The meeting was held at the Medical Science Teaching Centre (MSTC) at the University of Oxford, in Oxford, United Kingdom.

Lodging

See the Lodging section of the GMOD Europe 2009 page for information on lodging for both the summer school and this meeting.

Cost and Registration

The cost was £50, which included a catered lunch on Friday. Space was limited to the first 50 people to register.

Mailing List

The meeting has a mailing list that all meeting related correspondence will be sent to:

august2009gmodmeeting@gmod.org

Any meeting participant can send an email to the list.

CBRG

We would like to thank the Computational Biology Research Group (CBRG) of the University of Oxford for hosting and financially supporting the week's events.

I would particularly like to thank Stephen Taylor, Simon McGowan and Zong-Pei Han for their help and support during the entire week of GMOD Europe 2009. We could not have done this without you. -- Dave C.

Attendees

First Name Last Name Affiliation
Ambrose Andongabo Rothamsted Research
ERICK ANTEZANA BAYER BIOSCIENCE NV
T. Grant Belgard MRC FGU
Alessandra Bilardi CRIBI - University of Padova
Dan Bolser Dundee University
Baptiste Brault INRA Versailles
Tim Burgis Imperial College- London
Scott Cain Ontario Institute for Cancer Research
Maria Cartolano University of Oxford
Dave Clements NESCent
Ros Cutts Imperial College
Etienne P de Villiers ILRI
Phil East Cancer Research UK
Matt Eldridge Cancer Research UK- Cambridge Research Institute
Ben Elsworth University of Edinburgh
Josh Goodman FlyBase (Indiana University)
Cyprien GUERIN INRA
Zong-Pei Han Computational Biology Research Group, Oxford
Andreas Heger MRC FGU
Ian Holmes UC Berkeley
Jim Hughes MRC
Bernd Jagla Institut Pasteur
Baptiste Laporte IBDML
Ed Lee Lawrence Berkeley National Laboratory
Jacob Lemieux Computational Biology Research Group
Siu-wai Leung University of Macau
Christopher Love Rothamsted Research
Emanuele Marchi University of Oxford
Simon McGowan Computational Biology Research Group, Oxford
Sheldon McKay Cold Spring Harbor Laboratory
FREDERIC POTIER BAYER BIOSCIENCE NV
Peter Rice European Bioinformatics Institute
Kim Rutherford University of Cambridge
michelle simon Medical Research Council
Daniel Sobral IBDML
Aengus Stewart London Research Institute CRUK
Julie Sullivan InterMine- Dept of Genetics- Cambridge
Steve Taylor Computational Biology Research Group, Oxford
Adrian Tivey Wellcome Trust Sanger Institute
Giles Velarde Welcome Trust Sanger Institute
Pieter Emiel Ver Loren van Themaat Macx Planck Institute for Plant Breeding Research
Jonathan Warren The Sanger Institue
Xikun Wu Institute for Animal Health
Jun Zhao University of Oxford
Pinglei Zhou Harvard University/FlyBase

Feedback

Attendees were asked to provide feedback at the end of the meeting.


Q: Would you recommend GMOD meetings to others

Yes Maybe No
100% 0% 0%


Q: Please rate the meeting(s) using the following scale: 1 (not at all) to 3 (reasonably) to 5 (exceptionally).

1 2 3 4 5
How useful was the meeting? 0% 0% 23% 53% 23%
Was the meeting well run and organized? 0% 0% 18% 47% 35%


Q: Was the meeting what you expected?

No. Yes. Yes!
0% 86% 14%

Longer responses:

  • Yes of course! The meeting was really interesting!
  • yes and it was good to for me to meet the developers
  • Yes, pretty much. It was in part this time just a good way to meet up with particular collaborators.
  • Yes, but I was hoping to learn more about Chado
  • Very very useful.


Q: Which presentations and sessions at this meeting were the most useful or interesting?


Q: Do you have suggestions for improving GMOD meetings in the future?

  • Another one in Europe please. We could host one in Hinxton but I am prepared to travel
  • I was able to come to the meeting because it was in Europe, so more meetings in Europe would be very helpful
  • Maybe some people can present posters during Coffee Breaks for the next GMOD meeting.
  • more sessions
  • Less instruction copying, more problem solving
  • no
  • I do think a informal or formal drinks or meal in the evening is a good idea, even if it's just - 'we are going to this pub to get a meal' which delegates can go to or not and then pay for themselves?
  • Better time keeping
  • Somewhere drier ;-) Seriously, it didn't seem to have the energy of some of the other 2 I've been to - maybe me or maybe people tired from the course
  • Try encouraging outsiders to bring non-genomic information to GMOD E.g. people from BDGP, ZFIN expression data, 4Dxpress, BGee, etc...


Additional feedback, suggestions, criticism, and praise.

  • This is the first time ever to learn to make use of so many useful bioinformatics tools from the developers and experts of them.
  • Thanks for the meeting.
  • Thanks very much to the organisers for their hard work - I definitely thought it was worth it
January 2010 GMOD Meeting

Next Meeting: January 2010 in San Diego California

The next GMOD Community Meeting was held January 14-15, 2010 in San Diego, California, United States, immediately following PAG 2010.