Difference between revisions of "Overview"

From GMOD
Jump to: navigation, search
m (A Browser for a Stock Collection)
m (What is Tripal?: Adding detail to the tripal listing)
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
''.. formerly titled "GMOD for the Biologist".
+
''... formerly titled "GMOD for the Biologist".
  
This page provides an overview of the GMOD project. It does not assume any particular background in computing.
+
This page provides an overview of the GMOD project. It does not assume any particular background in computing.
  
 
==Introduction==
 
==Introduction==
  
With the amount of technical documentation available for GMOD the casual observer would be forgiven if they concluded that GMOD was a project about software. But it's not, GMOD has been created ''for biologists'' and in the real world it's used ''by biologists''. However, the creators of GMOD are mostly not practicing biologists and the look and the feel of most GMOD documentation reflects this. What we will attempt to do is discuss GMOD from the researchers' perspective. This does not simply mean describe what the software does. If you look, for example, at a typical [[GBrowse]] page like this [http://www.chr7.org/cgi-bin/gbrowse/gbrowse/chr7_v2 GBrowse view of human chromosome 7] you'll understand immediately what GBrowse is built to do, and a few more minutes of clicking and scrolling will reveal all sorts of useful ways to display and query the data. A modern biologist knows a great deal about bioinformatics functionality already. What we're more concerned with here are the practical details. Like ''given the data I have what database should I use? or do I even need a database?'' Or ''how hard is this going to be?''
+
With the amount of technical documentation available for GMOD, the casual observer would be forgiven if they concluded that GMOD was a project about software. But it's not: GMOD has been created ''for biologists'' and in the real world, it is used ''by biologists''. However, the creators of GMOD are mostly not practicing biologists and the look and the feel of most GMOD documentation reflects this. What we will attempt to do is discuss GMOD from the researchers' perspective. This does not simply mean describe what the software does. If you look, for example, at a typical [[GBrowse]] page (e.g. [http://www.chr7.org/cgi-bin/gbrowse/gbrowse/chr7_v2/ GBrowse view of human chromosome 7]), you'll understand immediately what GBrowse is built to do, and a few more minutes of clicking and scrolling will reveal all sorts of useful ways to display and query the data. A modern biologist knows a great deal about bioinformatics functionality already. This introduction is more concerned with answering practical questions, such as ''given the data I have, what database should I use?", "do I even need a database?'', or ''how hard is this going to be?''.
  
In our experience we find that most biologists want to focus on the science. They may have little knowledge of programming languages or databases, and only passing interest in the IT minutiae. They have deep knowledge of their own data, needless to say, and know how data like their own can be viewed and analyzed. What they want to know is how to create their own useful set of tools for their own data in as efficient a way as possible. And when this tool set is created they want to rest assured that their platform can be easily maintained in an environment where resources may be limited. We will attempt to address these sorts of questions.
+
In our experience we find that most biologists want to focus on the science. They may have little knowledge of programming languages or databases, and only passing interest in the IT minutiae. They have deep knowledge of their own data and know how such data can be viewed and analyzed. These biologists want to know how to efficiently create a useful set of tools for their data, and to be assured that their platform and tools can be easily maintained in an environment where resources may be limited. We will attempt to address these sorts of questions.
  
By the way, the word ''we'' used here refers to the [[GMOD_Help_Desk|GMOD Help Desk]]. The Help Desk is a good resource for biologists who want to learn more about GMOD, for whatever reason. Feel free to email us at [mailto:help@gmod.org help@gmod.org].
+
By the way, the word ''we'' used here refers to the [[GMOD_Help_Desk|GMOD Help Desk]]. The Help Desk is a good resource for biologists who want to learn more about GMOD. Feel free to email us at [mailto:help@gmod.org help@gmod.org].
  
 
==What is a GMOD?==
 
==What is a GMOD?==
  
GMOD is a collection of interconnected applications and databases that biologists use as repositories and as tools. That connectivity is really the key here. Bioinformatic applications and databases are produced at a steady rate and this output is described each month in a number of different journals. There's no lack of tools, but many of these tools will be little used since the typical prospective user may not have the resources or expertise required to install the tool and connect it, in some way, to the data in hand. What is generally lacking is a concerted effort to produce tools and databases that will work together.
+
GMOD is a collection of interconnected applications and databases that biologists use as repositories and as tools. That connectivity is really the key here. Bioinformatics applications and databases are produced at a steady rate and this output is described each month in a number of different journals. There's no lack of tools, but many of these tools will be little used as prospective users may not have the resources or expertise required to install the tool and hook it up to their data. What is generally lacking is a concerted effort to produce tools and databases that will work together; GMOD fills this void by providing the means to store data and a comprehensive set of tools for manipulating that data.
  
GMOD also describes a community. Many of the pieces of GMOD, or components, are mature software with many human-years of software development behind them. This amount of effort focussed on design, development, and testing has not occurred simply because someone wanted to code. The demand for software like this has been strong since genome sequence started to appear and many of the first genome databases used GMOD components. So ''GMOD'' describes this diverse group of software developers, scientists, and laboratories that use or improve these software components every day.
+
GMOD also describes a community. Many of the GMOD software components are mature software with many human-years of software development behind them. The design, development, and testing has been driven by a diverse group of software developers, scientists, and laboratories that use or improve these software components every day. The demand for software like this has been strong since genome sequence started to appear and many of the first genome databases used GMOD components. GMOD database and software components have developed and expanded with the massive growth and development of genome projects and the changing needs of users.
  
GMOD is also that specific thing that's installed on your computer. It may be the private viewer to your latest data that a student set up over the weekend. It may a terabyte-size database and suite of public Web applications developed over many years at a central laboratory. It may a database of experimental data that's accessible by script, or it may the annotation tool that you use to describe your favorite genome. Now, by describing this variety are we assuring you that whatever you want to do is possible within GMOD? No. The biologists lead and the software developers follow, not the other way around. So you may find that your predicament is not addressed, or is only partially addressed, by what's available in GMOD. You have an option here, which is to do something about this. First, contact the [[GMOD_Help_Desk|GMOD Help Desk]] or one of the main mailing lists like [https://lists.sourceforge.net/lists/subscribe/gmod-schema gmod-schema] to make sure that your understanding of the available GMOD resources is correct. When you're in touch with some knowledgable person try to get a sense of what the solution might be, or its degree of difficulty. It may be that your solution may entail something simple, or it may be that a project may have to be created, complete with partnerships and grants. Be assured that the GMOD participants are very interested in seeing GMOD take off in new directions.
+
GMOD is also that specific thing that is installed on your computer. It may be the private viewer to your latest data that a student set up over the weekend. It may a terabyte-size database and suite of web applications developed over many years at a central laboratory. It may a database of experimental data accessible by script, or it may the annotation tool that you use to describe your favorite genome.
 +
 
 +
GMOD does not claim to cover every potential data storage, analysis or manipulation request that a biologist may have, but GMOD is a project directed by its user base: the biologists lead and the software developers follow, not the other way around. If you find your predicament is not addressed, or is only partially addressed, by what's available in GMOD, your first step should be to contact the [[GMOD_Help_Desk|GMOD Help Desk]] or one of the main mailing lists like [https://lists.sourceforge.net/lists/subscribe/gmod-announce gmod-announce] to make sure that your understanding of the available GMOD resources is correct. When you're in touch with some knowledgable person, try to get a sense of what the solution might be, or its degree of difficulty. It may be that your solution may entail something simple, or it may be that a project may have to be created, complete with partnerships and grants. GMOD participants are always interested in seeing GMOD take off in new directions.
  
  
 
===Is It Just for ''Model'' Organisms?===
 
===Is It Just for ''Model'' Organisms?===
  
At first GMOD stood for '''G'''eneric '''M'''odel '''O'''rganism '''D'''atabase, this was back in the days when there were a handful of ''model organisms'' and it appeared that obtaining the genomic sequence of an organism was a prohibitively expensive proposition, taking months or years to accomplish. Now there are hundreds of such sequences, with thousands easily conceivable. However, few of the scientists studying organisms with sequence consider their organism a ''model'', in this early sense of the word.
 
  
This is a problem for the acronym since ''any'' organism with ''any'' kind of sequence associated with it is a good candidate as a subject for a GMOD database. So, for example, there are GMOD databases with just protein sequence in them like the [http://db.yeastgenome.org/cgi-bin/gbrowse/scproteome/ ''S. cerevisiae'' Proteome Browser]. There are GMOD databases with EST sequence only, such as the [http://racerx00.tamu.edu/bovine/cattle_est_db.html Cattle EST Gene Family Database]. There are GMOD databases that are concerned primarily with gene expression, such as the [http://gmod.mbl.edu/perl/site/emiliania04?page=intro ''Emiliania huxleyi'' Serial Analysis of Gene Expression] database. We even find GMOD databases dedicated to collections of RNA sequence like the [http://gmod.mbl.edu/gb/gbrowse/ltgrna ''Leishmania tarentolae'' RNA Editing] database. We have also heard of GMOD databases for things like oligonucleotides and plasmids. See [[GMOD_Users|GMOD Users]] for a list of other examples. That list of GMOD databases demonstrates that GMOD is widely used, with all sorts of organisms represented, and that these databases can hold sequences of any kind.
+
GMOD stands for '''G'''eneric '''M'''odel '''O'''rganism '''D'''atabase; it was named back in the days when there were a handful of ''model organisms'' and it appeared that obtaining the genomic sequence of an organism was a prohibitively expensive proposition, taking months or years to accomplish. With the ease and speed at which genomes can now be sequenced, few scientists would consider their organism a 'model' in this early sense of the word, so we suggest users think of the ''M'' as standing for ''Myriad'' or ''My''.
 +
 
 +
Any organism with any kind of sequence associated with it is a good candidate as a subject for a GMOD database. There are GMOD databases with just protein sequence in them like the [http://browse.yeastgenome.org/fgb2/gbrowse/scproteome/ ''S. cerevisiae'' Proteome Browser]. There are GMOD databases with EST sequence only, such as the [http://128.206.12.216/cgi-bin/gbrowse/cattle_est_cluster/ Cattle EST Gene Family Database]. There are GMOD databases that are concerned primarily with gene expression, such as the [http://forest.mbl.edu/cgi-bin/site/emiliania04 ''Emiliania huxleyi'' Serial Analysis of Gene Expression] database. We even find GMOD databases dedicated to collections of RNA sequence like the ''Leishmania tarentolae'' RNA Editing database. We have also heard of GMOD databases for oligonucleotides and plasmids. See [[GMOD_Users|GMOD Users]] for a list of other examples. The list of GMOD databases demonstrates that GMOD is widely used, with many organisms represented, and that these databases can hold sequences of any kind.
  
Some clever scientists have proposed that we just drop ''Model'' from the name and "re-brand" ourselves. GMOD thought about it and decided it may cause more problems than it solves. Instead, think of the ''M'' as standing for ''My'' or ''Many'' or ''Myriad''.
 
  
 
==Technologies==
 
==Technologies==
Line 34: Line 36:
 
|-
 
|-
 
|[[Image:mini-arch-diagram.png]]
 
|[[Image:mini-arch-diagram.png]]
|Most GMOD installations have a general architecture in common. There is some source of data and this is going to be called a [[Glossary#Database|database]]. However it does not have to be a relational database, it could be a file or a set of files with or without some kind of index. There's a lot of flexibility at the data level. Choosing this database and loading it will be tasks you'll give a lot of thought to.
+
|Most GMOD installations have a general architecture in common. There is a data source: the [[Glossary#Database|database]]. This does not have to be a relational database; it could be a file or a set of files with or without some kind of index. There's a lot of flexibility at the data level. Choosing this database and loading it will be tasks you'll give a lot of thought to.
  
  
What the user will see is one or more applications. These may be a
+
What the user will see is one or more applications. These may be a set of web pages or a Java application. The choice of applications is dictated by the nature of your data. Sometimes the choice of application is easy or clear for a given kind of data. For some other data types you'll have to take a careful look at a few different applications and consider whether you want to invest more resources in order to create complex data representations or whether you want to expend less effort and offer something simpler.
set of Web pages or a Java application. The choice of applications is
+
dictated by the nature of your data. Sometimes the choice of
+
application is easy or clear for a given kind of data. For some other
+
data types you'll have to take a careful look at a few different
+
applications and consider whether you want to invest more resources in
+
order to create complex data representations or whether you want to
+
expend less effort and offer something simpler.
+
  
  
There will also be software mediating the flow of information between
+
There will also be software mediating the flow of information between the application and the database. Typically this is going to be a [[#Apache: The Web Server|Web server]]: the Web server receives ''requests'' from the application and translates them to database queries, then receives data, formats it, and sends it back to the application. This piece of software can be thought as performing routine or mechanical tasks. It's important but you'll install it and typically pay little attention to it.
the application and the database. Typically this is going to be a [[#Apache: The Web Server|Web server]]: the Web server receives ''requests'' from the application and
+
translates them to database queries, then receives data, formats it, and sends
+
it back to the application. This piece of software can be thought
+
as performing routine or mechanical tasks. It's important but you'll
+
install it and typically pay little attention to it.
+
 
|-
 
|-
 
|}
 
|}
Line 63: Line 53:
 
===What is GBrowse?===
 
===What is GBrowse?===
  
[[GBrowse]] is short for ''Genome Browser'', or ''Generic Genome Browser''. [[GBrowse]] is probably GMOD's most popular component and almost all of the databases listed in [[GMOD_Users|GMOD Users]] use GBrowse. It is fairly easy to install, only basic [[wp:Command_line|command-line]] familiarity is required. Do not be misled by the simplicity of the installation though: the reason that GBrowse is popular is because it is a supremely capable browser. The picture below is a partial screenshot of a GBrowse page taken from the [http://www.chr7.org/cgi-bin/gbrowse/gbrowse/chr7_v2 Human chromosome 7 database at TCAG]. A bit of jargon: the rows, each depicting one sort of data, are called ''tracks'' and tracks are populated by one or more images called ''glyphs'', plus text.
+
[[GBrowse]] is short for ''Genome Browser'', or ''Generic Genome Browser''. GBrowse is probably GMOD's most popular component and almost all of the databases listed in [[GMOD_Users|GMOD Users]] use GBrowse. It is fairly easy to install, requiring only basic [[wp:Command_line|command-line]] familiarity. GBrowse is popular is because it is a supremely capable browser. The picture below is a partial screenshot of a GBrowse page taken from the [http://www.chr7.org/cgi-bin/gbrowse/gbrowse/chr7_v2 Human chromosome 7 database at TCAG]. A bit of jargon: the rows, each depicting one sort of data, are called ''tracks'' and tracks are populated by one or more images called ''glyphs'', plus text.
  
  
Line 69: Line 59:
  
  
As of this writing GBrowse comes with some 75 different glyphs, including pie charts, dot plots, histograms, and X-Y plots suitable for quantitative data, as well as the expected array of glyphs that describe sequences and sequence annotation. It is also highly configurable, meaning you can do quite a bit of customization of the glyphs, you can link glyphs to URLs of your choice, you can ''internationalize'' the application to display different languages, you can connect and retrieve data from any database, and more. This sort of work generally requires either modifying GBrowse's configuration files or adding your own [[wp:Perl|Perl]] code, the language that [[GBrowse]] is written in. Any customization requiring work in Perl should be considered routine coding, not difficult, and the explanation for this is that GBrowse is ''built to be customized''.
+
GBrowse comes with a large library of glyphs, including pie charts, dot plots, histograms, and X-Y plots suitable for quantitative data, as well as the expected array of glyphs that describe sequences and sequence annotation. It is also highly configurable, meaning you can do quite a bit of customization of the glyphs, you can link glyphs to URLs of your choice, you can ''internationalize'' the application to display different languages, you can connect and retrieve data from any database, and more. This sort of work generally requires either modifying GBrowse's configuration files or adding your own code. GBrowse is written in [[wp:Perl|Perl]]; as GBrowse is designed to be customized, extending its functionality with your own code should not require expert coding skill.
 +
 
 +
 
 +
====JBrowse====
 +
 
 +
[[JBrowse]] is a genome browser with a fully dynamic HTML5 user interface, being developed as the successor to GBrowse. It is very fast and scales well to large datasets. JBrowse is javascript-based and does almost all of its work directly in the user's web browser, with minimal requirements for the server. JBrowse's features include:
 +
 
 +
*Fast, smooth scrolling and zooming. Explore your genome with unparalleled speed.
 +
*Scales easily to multi-gigabase genomes and deep-coverage sequencing.
 +
*Supports GFF3, BED, FASTA, Wiggle, BigWig, BAM, VCF (with tabix), REST, and more.  BAM, BigWig, and VCF data are displayed directly from the compressed binary file with no conversion needed.
 +
*Very light server resource requirements. JBrowse has no back-end server code, just tools for formatting data files to be read directly over HTTP. Serve huge datasets from a single low-cost cloud instance.
 +
 
 +
[[File:JBrowse_alignment_and_coverage.png|600px|center|JBrowse alignment and coverage]]
 +
 
 +
Screenshot of JBrowse in action
  
 
===Relational Databases===
 
===Relational Databases===
  
[http://www.databasejournal.com/sqletc/article.php/1469521 Relational databases] are today's tool of choice when faced with the problem of storing complex or multifaceted data, assuming that the data is, or can be, broken down into ever smaller bits of data. All atomized data will end up in one ''field'', analogous to the way that data can be organized as columns in a spreadsheet. Fields describing one sort of thing are organized together into ''tables'' (but database designers do not talk about ''things'', rather ''entities''). For example, a relational database may have a table called ''gene'' with ''gene.name'' and ''gene.geneid'' fields and a ''protein'' table with ''protein.name'', ''protein.proteinid'' and ''protein.sequence'' fields.
+
For those unfamiliar with databases, the [[A_Brief_Guide_to_Databases|brief guide to databases]] provides a gentle introduction.
 +
 
 +
[http://www.databasejournal.com/sqletc/article.php/1469521 Relational databases] are today's tool of choice when faced with the problem of storing complex or multifaceted data, assuming that the data is, or can be, broken down into ever smaller bits of data. All atomized data will end up in one ''field'', analogous to the way that data can be organized as columns in a spreadsheet. Fields describing aspects of a "thing" (an ''entity'') are organized together into ''tables''. For example, a relational database may have a table called ''gene'' with ''gene.name'' and ''gene.geneid'' fields and a ''protein'' table with ''protein.name'', ''protein.proteinid'' and ''protein.sequence'' fields.
  
  
Line 79: Line 85:
  
  
The picture above shows these two tables, and explains the term ''relational''. The relation between the
+
The picture above shows these two tables, and explains the term ''relational''. The relation between the tables is the shared ''geneid'' - we add the ''geneid'' field to the ''protein'' table to indicate that the CFTR_1 protein record relates back to a specific gene in the ''gene'' table. This ''geneid'' field in ''protein'', which originates in ''gene'' and whose values are stored in ''gene'', is an example of a ''foreign key'', a field from a table that is shared by one or more other tables.
tables is the shared ''geneid'' - we add the ''geneid'' field to the ''protein''
+
table to indicate that the CFTR_1 protein record relates
+
back to a specific gene in the ''gene'' table. This ''geneid'' field in ''protein'', which originates in ''gene'' and whose values are stored in ''gene'', is an example of a special sort of field called a ''foreign key'' - think of it as a shared field or value.
+
  
For a given collection of data, genomic sequence and annotation for
+
For a given collection of data, genomic sequence and annotation for example, there will be more than one way to represent the data relationally. A given relational design, essentially tables and fields, is called a ''[[Glossary#Database Schema|schema]]'' (think of the schema as a blueprint, empty, and the schema populated with data as the database). Both [[Chado]] and [[BioSQL]] can store genomic data for example, but they do it differently. The details of schema design are not relevant here but one can say that the designer may think about some of these general concerns:
example, there will be more than one way to represent the data
+
relationally. A given relational design, essentially tables and
+
fields, is called a ''[[Glossary#Database Schema|schema]]'' (think of the schema as a blueprint,
+
empty, and the schema populated with data as the
+
database). Both [[Chado]] and [[BioSQL]] can store genomic data for example, but they do it differently. The details of how one designs relational schemas is not relevant here but one can say that the designer may think about some of these general concerns:
+
  
 
* The degree of data abstraction, which is related to a database concept called [[wp:Database_normalization|normalization]] and to the flexibility of the schema
 
* The degree of data abstraction, which is related to a database concept called [[wp:Database_normalization|normalization]] and to the flexibility of the schema
* The legibility of the schema, which has to do with the ease of using it
+
* The legibility of the schema, which has to do with the ease of using it
 
* The breadth of the schema, in terms of the data types it could contain
 
* The breadth of the schema, in terms of the data types it could contain
  
From the scientific perspective one can ask related questions: how
+
From the scientific perspective one can ask related questions:
flexible is a given schema? Can it handle my data now and in the
+
 
future? Will using a given schema be easier or harder to use than some
+
*How flexible is a given schema?
other schema?
+
*Can it handle my data now and in the future?
 +
*Will using a given schema be easier or harder to use than some other schema?
 +
 
 +
This last question relates mostly to the degree of abstraction of the schema, not to the actual programming languages used.
 +
 
 +
All of today's relational databases are created and loaded and queried using one language, [[wp:SQL|SQL]]; programmers use their chosen language ([[wp:Perl|Perl]], [[wp:Java|Java]], [[wp:Python|Python]], etc.) to execute SQL queries and process the results.
  
This last question relates mostly to the degree of abstraction of the schema,
 
not to the actual programming languages used. All of today's
 
relational databases are created and loaded and queried using one
 
language, [[wp:SQL|SQL]]. Essentially what the programmers  do is
 
use their chosen language ([[wp:Perl|Perl]], [[wp:Java|Java]],
 
[[wp:Phython|Python]], etc.) to execute SQL and they all do this
 
equally well.
 
  
 
See also:
 
See also:
 +
* [[A Brief Guide to Databases]]
 
* [[Databases and GMOD]]
 
* [[Databases and GMOD]]
 
* [[:Category:Database Tools|Database Tools]]
 
* [[:Category:Database Tools|Database Tools]]
Line 114: Line 111:
 
====Chado and BioSQL====
 
====Chado and BioSQL====
  
So when you choose to use a relational schema it will all really come down to you and your data, not technical
+
So when you choose to use a relational schema it will all really come down to you and your data, not technical details. [[Chado]] is one of the [http://www.databasejournal.com/sqletc/article.php/1469521 relational databases] that are used in GMOD, the other being [http://biosql.org BioSQL].
details. [[Chado]] is one of the [http://www.databasejournal.com/sqletc/article.php/1469521 relational databases] that are used in GMOD, the other being [http://biosql.org BioSQL].
+
 
The differences are clear. [http://biosql.org BioSQL] is quite focussed, it's concerned with:
+
The differences are clear. [http://biosql.org BioSQL] is quite focussed and is concerned with:
  
 
* Sequence
 
* Sequence
Line 123: Line 120:
 
* Publications
 
* Publications
  
It is also a thoroughly modern schema in that it uses
+
It is also a thoroughly modern schema in that it uses [http://obofoundry.org OBO]-style ontologies, such as the [http://geneontology.org Gene Ontology] (GO). This is a requirement, given the ubiquity of ontologies and their ability to describe and organize data.
[http://obofoundry.org OBO]-style ontologies such as GO, the [http://geneontology.org Gene Ontology]. This is a requirement nowadays given the ubiquity of ontologies and their ability to describe and organize our data.
+
  
Chado's focus is broader. Its tables are broken down into groups called
+
Chado's focus is broader. Its tables are broken down into groups called ''modules''; the modules are the following:
''modules'' and the modules are the following:
+
  
 
{{ChadoModules}}
 
{{ChadoModules}}
  
  
It is also possible to ''add'' modules to Chado. For instance in early 2007 a module called [[Chado_Mage_Module|mage]] was added, this one addresses microarray data. Other possibilities that are being discussed are modules for ecological data and additional work for phenotypic data, extending the existing [[Chado_Phenotype_Module|phenotype module]]. The real point is that Chado has been designed to allow extensibility, and one can either formally propose that Chado acquire some new functionality as a module or you can add tables to Chado in the privacy of your own server.
+
It is also possible to ''add'' modules to Chado. For instance, in early 2007 a module called [[Chado_Mage_Module|mage]] was added to addresses microarray data. Other possibilities that are being discussed are modules for ecological data and additional work for phenotypic data, extending the existing [[Chado_Phenotype_Module|phenotype module]]. The real point is that Chado has been designed to allow extensibility, and one can either formally propose that Chado acquire some new functionality as a module or you can add tables to Chado in the privacy of your own server.
  
Chado is also ontology-aware. One could state this even more forcefully: Chado depends on ontologies. For example in [[Chado_Sequence_Module|Chado's Sequence module]] it's expected that all stored sequences are identified by one or more terms from the [http://sequenceontology.org Sequence Ontology]. A quick scan of the [[Chado_Tables|tables in Chado]], more than 100, shows that about half of the tables contain the field, foreign key, ''cvterm'', referring to an ontology term. The ontology used as source for a term could be one of many but people in the field tend to rely on [http://obofoundry.org OBO] ontologies. So the ontology could be a common and general one like GO, the [http://geneontology.org Gene Ontology], or something highly specific to a group of organisms like the [http://obo.sourceforge.net/cgi-bin/detail.cgi?fly_anatomy Drosophila Anatomy ontology] or the [http://www.informatics.jax.org/searches/MP_form.shtml Mammalian Phenotype ontology]. What these ontologies do in conjunction with Chado is give you a database that is extremely flexible, and as your ontologies expand so does the expressive capability of the system.
+
Chado is also ontology-aware. One could state this even more forcefully: Chado depends on ontologies. For example, in [[Chado_Sequence_Module|Chado's Sequence module]] it is expected that all stored sequences are identified by one or more terms from the [http://sequenceontology.org Sequence Ontology]. A quick scan of the [[Chado_Tables|tables in Chado]], more than 100, shows that about half of the tables contain the field, foreign key, ''cvterm'', referring to an ontology term. The ontology used as source for a term could be one of many but people in the field tend to rely on [http://obofoundry.org OBO] ontologies. So the ontology could be a common and general one like GO, the [http://geneontology.org Gene Ontology], or something highly specific to a group of organisms like the [http://obo.sourceforge.net/cgi-bin/detail.cgi?fly_anatomy Drosophila Anatomy ontology] or the [http://www.informatics.jax.org/searches/MP_form.shtml Mammalian Phenotype ontology]. In conjunction with Chado, these ontologies give you a database that is extremely flexible, and as your ontologies expand, so does the expressive capability of the system.
  
Now there is a cost to this flexibility and breadth: Chado is complex and one must devote a certain amount of study to it, it's unlikely that someone unfamiliar with Chado can install it and then immediately set about loading it with biological data of different sorts. Fortunately there are [[GMOD_Mailing_Lists|mailing lists]] you can contact, as well as the [[GMOD_Help_Desk|GMOD Help Desk]], and number of pages on this Wiki discussing Chado (see [[Chado_-_Getting_Started|Getting Started with Chado]] and the [[Chado_Manual|Chado Manual]]).
+
There is a cost to this flexibility and breadth: Chado is complex and it is unlikely that someone unfamiliar with Chado would be able to install it and then immediately set about loading it with biological data of different sorts. Fortunately there are [[GMOD_Mailing_Lists|mailing lists]] you can contact, as well as the [[GMOD_Help_Desk|GMOD Help Desk]], and number of pages on this Wiki discussing Chado (see [[Chado_-_Getting_Started|Getting Started with Chado]] and the [[Chado_Manual|Chado Manual]]).
  
 
====GFF Databases====
 
====GFF Databases====
  
In addition to relational database schemas like [[Chado]] and [[BioSQL]] you will also encounter what are called ''GFF databases'' in the GMOD world. [[GFF]] is a compact format for describing sequence and sequence annotations. GMOD installations like the [http://www.chr7.org/cgi-bin/gbrowse/gbrowse/chr7_v2 Human Chromosome 1 database] described above are concerned solely with sequence and annotation and the entire contents of such a database can be represented as GFF. For small installations the entire database can be just a set of GFF text files (in fact, you can install [[GBrowse]] on your personal computer and then browse ''Saccharomyces'' and ''Volvox'' genomic sequence, reading directly from GFF files installed along with GBrowse - ''try it!''). But when the amount of GFF gets too large to be read into memory all at once you have to store the GFF in some form that's indexed for fast retrieval. The solution is to load the GFF into [[MySQL]] or some other sort of database management system, this assures good performance even if you have very large amounts of data in GFF format. This is accomplished by using the Bio::DB::GFF or Bio::DB::SeqFeature [[GBrowse Adaptors]].
+
In addition to relational database schemas like [[Chado]] and [[BioSQL]] you will also encounter ''GFF databases''. [[GFF]] is a compact format for describing sequence and sequence annotations. GMOD installations like the [http://www.chr7.org/cgi-bin/gbrowse/gbrowse/chr7_v2 Human Chromosome 1 database] described above are concerned solely with sequence and annotation and the entire contents of such a database can be represented as GFF. For small installations the entire database can be just a set of GFF text files (in fact, you can install [[GBrowse]] on your personal computer and then browse ''Saccharomyces'' and ''Volvox'' genomic sequence, reading directly from GFF files installed along with GBrowse - ''try it!''). But when the amount of GFF gets too large to be read into memory all at once you have to store the GFF in some form that's indexed for fast retrieval. The solution is to load the GFF into [[MySQL]] or some other sort of database management system, this assures good performance even if you have very large amounts of data in GFF format. This is accomplished by using the Bio::DB::GFF or Bio::DB::SeqFeature [[GBrowse Adaptors]].
  
===What is Apollo?===
+
===What are WebApollo and Apollo?===
  
Unlike [[GBrowse]], a browser, [[Apollo]] is for both viewing and
+
Unlike [[JBrowse]] and [[GBrowse]], which only function as a sequence browser, [[WebApollo]] and its standalone predecessor [[Apollo]], are for both viewing and manually annotating genomes. WebApollo is a plugin for [[JBrowse]] that allows multiple users to annotate genomes concurrently. Changes made by others are automatically and immediately updated in the user's browser window, ensuring that there is no duplication of effort, and allowing several users to annotate parts of the same sequence at the same time. A full history of all edits is kept, and the changes made in editing sessions can be approved or rejected by an administrator before being saved. WebApollo shares JBrowse's fast, flexible browsing interface, and users require only a web browser to use it.
manually annotating genomes. It also differs from GBrowse in that
+
 
it's a [[:Category:Java|Java application]] so there features built in to it that make annotating a bit more efficient than through a Web page. It can
+
[[Apollo]] is a standalone [[:Category:Java|Java application]] for manual sequence annotation, and is the predecessor of WebApollo. Apollo can read and write to Chado databases, but lacks the instant updates that WebApollo features. We recommend using [[WebApollo]] as it is under active development and has a more full feature set than Apollo.
connect to some of the same databases as GBrowse, like [[Chado]], so one can imagine using Apollo as a tool for expert curation and GBrowse as a viewer on
+
the same data set, for example.
+
  
 
=== What are MAKER and DIYA? ===
 
=== What are MAKER and DIYA? ===
  
[[GBrowse]] and [[Apollo]] both deal with [[:Category:Annotation|genome annotations]], but where do these annotations come from? Frequently they come from a ''genome annotation pipeline'', a software package or series of software packages that take an assembly (and other things) as input and produces an annotated genome, often with gene models, ESTs, proteins, and almost anything else that can be tied back to a genomic sequence.
+
[[GBrowse]] and [[Apollo]] both deal with [[:Category:Annotation|genome annotations]], but where do these annotations come from? Frequently they come from a ''genome annotation pipeline'', a software package or series of software packages that take an assembly (and other things) as input and produces an annotated genome, often with gene models, ESTs, proteins, and almost anything else that can be tied back to a genomic sequence.
  
[[MAKER]] is a genome annotation pipeline that produces annotated eukaryotic genomes, and [[DIYA]] is a genome annotation pipeline for prokaryotic genomes (and both do more than that too). They both produce gene models in [[GFF]], a file format that can be directly loaded into [[GBrowse]], [[Apollo]], and [[Chado]].
+
[[MAKER]] is a genome annotation pipeline that produces annotated eukaryotic genomes, and [[DIYA]] is a genome annotation pipeline for prokaryotic genomes (and both do more than that too). They both produce gene models in [[GFF]], a file format that can be directly loaded into [[GBrowse]], [[Apollo]], and [[Chado]].
  
 
=== What is Pathway Tools? ===
 
=== What is Pathway Tools? ===
  
[[Pathway Tools]] is a software system for creating organism-specific databases.
+
[[Pathway Tools]] is a software system for creating organism-specific databases. It contains extensive functionality that spans from genomes to pathways including a genome browser, metabolic pathway predictor and viewer, and regulatory network viewer, as well as a large number of interactive annotation tools.
It contains extensive functionality that spans from genomes to pathways including
+
a genome browser, metabolic pathway predictor and viewer, and regulatory network
+
viewer, as well as a large number of interactive annotation tools.
+
  
 
===What is CMap?===
 
===What is CMap?===
  
[[CMap]] is a popular [[:Category:Comparative Genomics|comparative]] map viewer. It was initially designed for use at [http://gramene.org Gramene] but was re-designed to be used for any organism or set of organisms. It can display genetic maps or physical maps and draw the relations between the two. It will also show synteny. It is written in [[wp:Perl|Perl]] and requires an underlying RDBMS such as [http://mysql.org Mysql]. If you need to display maps or syntenic relationships then you may need more than [[GBrowse]].
+
[[CMap]] is a popular [[:Category:Comparative Genomics|comparative]] map viewer. It was initially created for use at [http://gramene.org Gramene] but was redesigned to be used for any organism or set of organisms. It can display genetic maps or physical maps and draw the relations between the two. It will also show synteny. It is written in [[wp:Perl|Perl]] and requires an underlying RDBMS such as [http://www.mysql.com MySQL]. If you need to display maps or syntenic relationships, you may need more than [[GBrowse]].
 +
 
  
 
====And SynView? or Sybil? or GBrowse_Syn?====
 
====And SynView? or Sybil? or GBrowse_Syn?====
  
Yes, there are other [[:Category:Comparative Genomics|comparative genomics]] viewers. The alternatives to [[CMap]] are [[GBrowse_syn]], [[Sybil]], and [[SynView]]. [[Sybil]] stores its data in [[Chado]] and accommodates quite a variety of different analyses, you should go to the [http://sybil.sourceforge.net/ Sybil Web site] if you want to learn more. [[GBrowse_syn]] and [[SynView]] build upon [[GBrowse]], and they can be considered a bit simpler than [[Sybil]] and [[CMap]]. You should take a good look at the respective Web sites and determine which is most suitable for you.
+
Yes, there are other [[:Category:Comparative Genomics|comparative genomics]] viewers. The alternatives to [[CMap]] are [[GBrowse_syn]], [[Sybil]], and [[SynView]]. [[Sybil]] stores its data in [[Chado]] and accommodates a variety of different analyses; go to the [http://sybil.sourceforge.net/ Sybil Web site] if you want to learn more. [[GBrowse_syn]] and [[SynView]] build upon [[GBrowse]], and they can be considered a bit simpler than [[Sybil]] and [[CMap]]. More information is available on their websites to help you determine which is most suitable for you.
 +
 
  
 
=== What is Tripal? ===
 
=== What is Tripal? ===
  
[[Tripal]] is a tool that creates a web sites on top of a [[Chado]] database. A problem for many researchers is a lack of IT or bioinformatics resources and [[Tripal]] is part of a solution. Instead of manually creating Web pages, one by one, and writing code to connect these pages to a database these tools perform these steps automatically. There are also ways to customize the resulting web site so the pages make sense to you. You ''will'' need some IT expertise to do this, this is an expert's tool, but the job will take much less time and effort and can be undertaken by a student, for example. You can also use this tool iteratively, building and re-building your site until it looks right. You will also need the [[Chado]] database.  [http://marinegenomics.org MarineGenomics.org] is an example site created with [[Tripal]].
+
[[Tripal]] is a web frontend for a Chado database that provides both an attractive, slick website for accessing and disseminating Chado data, and an interface for local users to upload and edit data in the database. Tripal is based on the popular content management system [http://drupal.org Drupal], and creates a customisable website from the Chado database. Tripal includes a number of analysis modules that allow the incorporation of external data (for example, Gene Ontology annotations), and tools such as [[GBrowse]], [[Galaxy]], and [[CMap]] can be integrated into the site. Tripal is very customisable and, as it is based on Drupal, extra website content can easily be added using the standard Drupal functionality. For groups looking to avoid the substantial investment of time and effort involved in creating a website for data display and dissemination, Tripal offers a simple, stylish solution.
  
 
===What is Modware?===
 
===What is Modware?===
  
[[Modware]] is a middleware package used in GMOD, written in [[wp:Perl|Perl]]. Middleware is software that mediates the exchange of information between applications, e.g. between Web pages and databases. If you want a serious discussion of the technical details please see the [[GMOD_Middleware|GMOD Middleware]] page. The purpose of introducing [[Modware]] here is to say the the GMOD developers have evaluated a number of Perl middleware packages and decided that Modware is the one that developers should use if they prefer to write in Perl. Like [http://bioperl.org Bioperl], Modware may be a package that you may need to install but won't need to understand in any detail.
+
[[Modware]] is a middleware package used in GMOD, written in [[wp:Perl|Perl]]. Middleware is software that mediates the exchange of information between applications, e.g. between web pages and databases; please see the [[GMOD_Middleware|GMOD Middleware]] page for technical details. GMOD developers have evaluated a number of Perl middleware packages and decided that Modware is best suited to GMOD Perl development. Like [http://bioperl.org Bioperl], Modware may be a package that you may need to install but won't need to understand in any detail.
 +
 
  
 
===What is BioPerl?===
 
===What is BioPerl?===
Line 183: Line 176:
 
[http://bioperl.org BioPerl] is a popular bioinformatics toolkit written in [[wp:Perl|Perl]]. The reason we mention it here is because many of the [[GMOD_Components|GMOD Components]] use parts of it. You will '''not''' have to learn BioPerl in order to use GMOD but you may have to install it.
 
[http://bioperl.org BioPerl] is a popular bioinformatics toolkit written in [[wp:Perl|Perl]]. The reason we mention it here is because many of the [[GMOD_Components|GMOD Components]] use parts of it. You will '''not''' have to learn BioPerl in order to use GMOD but you may have to install it.
  
On the other hand [[BioPerl]] does offer some attractive ways to store genomic data, not requiring any sort of [[Glossary#Relational Database Management System|relational]] database. We discussed [[Chado]] and [[BioSQL]] above. These two relational [[Glossary#Schema|schemas]] require the prior installation of some free, open source [[Glossary#RDBMS|RDBMS]] like [[MySQL]] or [[PostgreSQL]]. Now installing these pieces, schema plus RDBMS, is not necessarily difficult but if all you have is sequence and sequence annotation it turns out that you can set up a sequence or genome browser using just BioPerl and [[GBrowse]] (and [http://apache.org Apache], your Web server). To be precise, you can use either the {{CPAN|Bio::DB::GFF}} module from BioPerl or the {{CPAN|Bio::DB::SeqFeature}} module. See [[#A_Simple_Sequence_Browser|A Simple Sequence Browser]] below.
+
[[BioPerl]] offers some attractive ways to store genomic data without requiring a [[Glossary#Relational Database Management System|relational]] database. We discussed [[Chado]] and [[BioSQL]] above; these two relational [[Glossary#Schema|schemas]] require the prior installation of some free, open source [[Glossary#RDBMS|RDBMS]] like [[MySQL]] or [[PostgreSQL]]. Installing these pieces, schema plus RDBMS, is not necessarily difficult, but if you only have sequence and sequence annotation, you can set up a sequence or genome browser using BioPerl, [[GBrowse]], and an [http://apache.org Apache] web server. You can use either the {{CPAN|Bio::DB::GFF}} module from BioPerl or the {{CPAN|Bio::DB::SeqFeature}} module. See [[#A_Simple_Sequence_Browser|A Simple Sequence Browser]] below.
  
 
===And What Else is in GMOD?===
 
===And What Else is in GMOD?===
  
A number of other software packages, listed below, classified by general function. One might be tempted to think of this as a shopping list, choosing one of each. But it may also be useful to think of what is absolutely essential first and consider these other components as ''add-ons''. We also have to add that some of these components are only ''loosely coupled'' to some of the more core components described above. In other words, an application might use its own methods to store data and not use [[Chado]]. Or, a component may be written in [[wp:Java|Java]] and not Perl, so it would not be able to communicate with a Perl application. For something to be considered a GMOD component it does not, at this time, have to connect to some other component.
+
A number of other software packages, listed below, classified by general function. One might be tempted to think of this as a shopping list, choosing one of each. But it may also be useful to think of what is absolutely essential first and consider these other components as ''add-ons''. Some of these components are only ''loosely coupled'' to some of the more core components described above; an application might use its own methods to store data and not use [[Chado]]. Or, a component may be written in [[wp:Java|Java]] and not Perl, so it would not be able to communicate with a Perl application. For something to be considered a GMOD component it does not, at this time, have to connect to some other component.
  
 
{{GMODComponents}}
 
{{GMODComponents}}
Line 199: Line 192:
 
===A Simple Sequence Browser===
 
===A Simple Sequence Browser===
  
* The data: sequence (genomic DNA or ESTs or proteins or cDNAs or some combination of these or…)
+
* The data: sequence (genomic DNA or ESTs or proteins or cDNAs or some combination of these or ...)
 
* The goal: create a browser to query and view sequence and sequence annotations
 
* The goal: create a browser to query and view sequence and sequence annotations
 
* The core software: [[GBrowse]], Apache Web server, and [http://bioperl.org Bioperl]
 
* The core software: [[GBrowse]], Apache Web server, and [http://bioperl.org Bioperl]
 
* The hardware: a server running Unix (Linux or Mac) or Windows
 
* The hardware: a server running Unix (Linux or Mac) or Windows
  
# Figure out what the annotations should be (i.e. gene coordinates, motif matches, oligonucleotide matches, etc. You can try using annotation pipelines like [[MAKER]] to automatically build these.)
+
# Figure out what the annotations should be (i.e. gene coordinates, motif matches, oligonucleotide matches, etc. You can try using annotation pipelines like [[MAKER]] to automatically build these.)
 
# Install core software
 
# Install core software
# Create or gather the annotations (BLAST results or HMMER results or GenBank files or…)
+
# Create or gather the annotations (BLAST results or HMMER results or GenBank files or ...)
 
# Transform all the annotations into a format suitable for loading ([[GFF]] format)
 
# Transform all the annotations into a format suitable for loading ([[GFF]] format)
 
# Load GFF into the GFF database
 
# Load GFF into the GFF database
Line 219: Line 212:
  
 
Highly recommended. Setting this up will give you a good sense of how the software pieces interoperate. Not only that, but [[GBrowse]] is fun and it comes with sample databases so once it's installed you have actual genome sequence to play with. You can even get GBrowse running nicely on a laptop.
 
Highly recommended. Setting this up will give you a good sense of how the software pieces interoperate. Not only that, but [[GBrowse]] is fun and it comes with sample databases so once it's installed you have actual genome sequence to play with. You can even get GBrowse running nicely on a laptop.
 +
  
 
===A Simple Sequence Browser plus a Sequence Annotator===
 
===A Simple Sequence Browser plus a Sequence Annotator===
  
* The data: sequence (genomic DNA or ESTs or cDNAs or some combination of these or…)
+
* The data: sequence (genomic DNA or ESTs or cDNAs or some combination of these or ...)
 
* The goal: create a browser to query and view sequence and sequence annotations along with an editor to manually annotate the sequences
 
* The goal: create a browser to query and view sequence and sequence annotations along with an editor to manually annotate the sequences
 
* The core software: [[GBrowse]], [[Apollo]], [[Chado]] (plus [[Glossary#Relational Database Management System|relational database]]), Apache Web server, and [[BioPerl]]
 
* The core software: [[GBrowse]], [[Apollo]], [[Chado]] (plus [[Glossary#Relational Database Management System|relational database]]), Apache Web server, and [[BioPerl]]
 
* The [[Computing Requirements|hardware]]: a server running [[Glossary#Unix|Unix]] (Linux or Mac) or Windows
 
* The [[Computing Requirements|hardware]]: a server running [[Glossary#Unix|Unix]] (Linux or Mac) or Windows
  
# Figure out what the annotations should be (gene coordinates or motif matches or oligonucleotide matches or hand-made annotations or some combination of these or…)
+
# Figure out what the annotations should be (gene coordinates or motif matches or oligonucleotide matches or hand-made annotations or some combination of these or ...)
 
# Install core software
 
# Install core software
# Create or gather the annotations (BLAST results or HMMER results or GenBank files or…)
+
# Create or gather the annotations (BLAST results or HMMER results or GenBank files or ...)
 
# Transform all the annotations into a format suitable for loading ([[GFF]] format)
 
# Transform all the annotations into a format suitable for loading ([[GFF]] format)
 
# Load GFF into the Chado database
 
# Load GFF into the Chado database
Line 236: Line 230:
  
 
A challenge: Step 2, installing core software (with more components you have a more complex system and more potential pitfalls, and Chado and its relational database is a fairly detailed install)
 
A challenge: Step 2, installing core software (with more components you have a more complex system and more potential pitfalls, and Chado and its relational database is a fairly detailed install)
 +
 
Possible challenge: Step 4, converting all the annotations to [[GFF]] (scripts may available to perform all the conversions, or you may have to write some of the conversion code yourselves)
 
Possible challenge: Step 4, converting all the annotations to [[GFF]] (scripts may available to perform all the conversions, or you may have to write some of the conversion code yourselves)
  
 
Skills needed: basic command-line competence, perhaps basic Perl competence if you have to write any custom conversion code. Some understanding of relational databases for the Chado installation. Basic Java competence for the Apollo installation.
 
Skills needed: basic command-line competence, perhaps basic Perl competence if you have to write any custom conversion code. Some understanding of relational databases for the Chado installation. Basic Java competence for the Apollo installation.
 +
 
Resources available: documentation at www.gmod.org, the GMOD Help Desk, the GMOD mailing lists
 
Resources available: documentation at www.gmod.org, the GMOD Help Desk, the GMOD mailing lists
  
 
====Recommendation====
 
====Recommendation====
  
If you’re a GMOD novice then install GBrowse by
+
If you're a GMOD novice, install GBrowse by itself first ([[#A Simple Sequence Browser|A Simple Sequence Browser]]), then consider this system.
itself first ([[#A Simple Sequence Browser|A Simple Sequence Browser]]), then consider this system.
+
 
  
 
===A Browser for a Stock Collection===
 
===A Browser for a Stock Collection===
  
* The data: the stock collection data in some structured form (Excel or Word or…)
+
* The data: the stock collection data in some structured form (Excel or Word or ...)
* The goal: create a browser to query and view your laboratory’s stock collection
+
* The goal: create a browser to query and view your laboratory's stock collection
 
* The core software: [[Chado]] (and its relational database), Apache Web server, and [[Turnkey]]
 
* The core software: [[Chado]] (and its relational database), Apache Web server, and [[Turnkey]]
 
* The hardware: a server running the Unix (Linux or Mac) or Windows [[Glossary#Operating System|operating system]].
 
* The hardware: a server running the Unix (Linux or Mac) or Windows [[Glossary#Operating System|operating system]].
Line 263: Line 259:
 
* Possible challenge: Step 3, running Turnkey to automatically create your browser. Turnkey is a new tool. It has been used successfully in testing and at [[ParameciumDB]] but not all possibilities have been tested.
 
* Possible challenge: Step 3, running Turnkey to automatically create your browser. Turnkey is a new tool. It has been used successfully in testing and at [[ParameciumDB]] but not all possibilities have been tested.
  
Skills needed: General IT expertise (Turnkey automates the creation of Web sites but it is an expert’s tool) Basic programming competence to write the custom conversion code.
+
Skills needed: General IT expertise (Turnkey automates the creation of Web sites but it is an expert's tool) Basic programming competence to write the custom conversion code.
 +
 
 
Resources available: documentation at [[Main Page|www.gmod.org]], the [[GMOD Help Desk]], the [[GMOD Mailing Lists]].
 
Resources available: documentation at [[Main Page|www.gmod.org]], the [[GMOD Help Desk]], the [[GMOD Mailing Lists]].
  
 
====Recommendation====
 
====Recommendation====
  
Consider whether you want to explore uncharted
+
Consider whether you want to explore uncharted territory or not. Could be fairly straightforward for the expert, or could be challenging.
territory or not. Could be fairly straightforward for the expert, or could
+
be challenging.
+
  
 
===A Browser for Microarray Data===
 
===A Browser for Microarray Data===
  
 
* The data: microarray data in Affymetrix format
 
* The data: microarray data in Affymetrix format
* The goal: create a browser to query and view your laboratory’s microarray
+
* The goal: create a browser to query and view your laboratory's microarray
* The core software: Chado, Apache Web server, and .
+
* The core software: Chado, Apache Web server, and ...
 
* The hardware: a server running Unix (Linux or Mac) or Windows
 
* The hardware: a server running Unix (Linux or Mac) or Windows
  
Challenge: Chado can hold the microarray data using its [[Chado_Mage_Module|Mage module]] and applications exist to view raw microarray data (e.g. [[Caryoscope]], [[GeneXplorer]]) but these applications don’t connect to Chado.
+
Challenge: Chado can hold the microarray data using its [[Chado_Mage_Module|Mage module]] and applications exist to view raw microarray data (e.g. [[Caryoscope]], [[GeneXplorer]]) but these applications don't connect to Chado.
  
 
Resources available: documentation at www.gmod.org, the GMOD Help Desk, the GMOD mailing lists
 
Resources available: documentation at www.gmod.org, the GMOD Help Desk, the GMOD mailing lists
Line 285: Line 280:
 
====Recommendation====
 
====Recommendation====
  
Either wait for the connectors to be built to some application or form a
+
Either wait for the connectors to be built to some application or form a partnership with GMOD scientists and developers to see that the connectors are built.
partnership with GMOD scientists and developers to see that the
+
connectors are built.
+
  
 
===A Browser for Map Data===
 
===A Browser for Map Data===
Line 301: Line 294:
  
 
Possible challenge: Step 2, the installation. This may tricky if you choose one of the more fully featured packages ([[CMap]] or [[Sybil]]).
 
Possible challenge: Step 2, the installation. This may tricky if you choose one of the more fully featured packages ([[CMap]] or [[Sybil]]).
 +
 
Possible challenge: Step 3, the loading. It is likely that some custom coding would be required since map data comes in all sorts of different forms.
 
Possible challenge: Step 3, the loading. It is likely that some custom coding would be required since map data comes in all sorts of different forms.
  
 
Skills needed: Basic command-line competence. Some understanding of relational databases for [[CMap]] or [[Sybil]]. Basic programming competence to write the custom loading code.
 
Skills needed: Basic command-line competence. Some understanding of relational databases for [[CMap]] or [[Sybil]]. Basic programming competence to write the custom loading code.
 +
 
Resources available: documentation at www.gmod.org, the GMOD Help Desk, the GMOD mailing lists
 
Resources available: documentation at www.gmod.org, the GMOD Help Desk, the GMOD mailing lists
  
 
====Recommendation====
 
====Recommendation====
  
Choose one. GMOD offers good choices here, it comes down to your
+
Choose one. GMOD offers good choices here, it comes down to your data and your resources. [[SynView]] is the easiest, and it comes with [[GBrowse]].
data and your resources. [[SynView]] is the easiest, and it comes with [[GBrowse]].
+
  
  
Line 323: Line 317:
 
===Software===
 
===Software===
  
GMOD is software that relies on other software in order to function. This section lists some other key open source packages that you may need.
+
GMOD software relies on other software to function. This section lists some other key open source packages that you may need.
  
  
 
==== Databases ====
 
==== Databases ====
  
The [[#Relational Databases|Relational Databases]] section above introduced many relational database concepts. [[Databases and GMOD]] discuses [[Glossary#Database Management System|database management system]] choices in GMOD. It also introduces some additional terminology.
+
The [[#Relational Databases|Relational Databases]] section above introduced many relational database concepts. [[Databases and GMOD]] discuses [[Glossary#Database Management System|database management system]] choices in GMOD. It also introduces some additional terminology.
 +
 
  
 
==== Programming Languages ====
 
==== Programming Languages ====
  
Two programming languages are popular in GMOD: Perl and Java. For most tasks you won't need to do any programming in either language. You will just need to know how to install these languages and how to install programs written in these languages. See [[Computing Requirements]] for more.
+
Two programming languages are popular in GMOD: Perl and Java. For most tasks you won't need to do any programming in either language. You will just need to know how to install these languages and how to install programs written in these languages. See [[Computing Requirements]] for more.
 
+
  
 
=====Perl=====
 
=====Perl=====
  
The programming language most used in the bioinformatics realm. Also
+
The programming language most used in the bioinformatics realm. Also the language most used by GMOD developers. It is well-suited to text and data processing and is also characterized by an extensive open source library, so it's highly functional. Many of GMOD components use [http://bioperl.org BioPerl], a bioinformatics toolkit written in Perl.
the language most used by GMOD developers. It is well-suited to text
+
and data processing and is also characterized by an extensive open
+
source library, so it's highly functional. Many of GMOD components use
+
[http://bioperl.org BioPerl], a bioinformatics toolkit written in
+
Perl.
+
  
Some pieces of GMOD, like [[GBrowse]], ''can'' be extended or customized
+
Some pieces of GMOD, like [[GBrowse]], ''can'' be extended or customized using Perl but beginners' skills in Perl would be sufficient for this work. Just installing and using [[GBrowse]] in a conventional way does not require knowledge of Perl or [http://bioperl.org BioPerl].
using Perl but beginners' skills in Perl would be sufficient for this
+
work. Just installing and using [[GBrowse]] in a conventional way does not require knowledge of Perl or [http://bioperl.org BioPerl].
+
  
 
=====Java=====
 
=====Java=====
  
Java is arguably the world's most popular programming language but it
+
Java is arguably the world's most popular programming language but it is not as popular for command-line work on Unix as Perl. It's encountered in GMOD primarily as a language to construct user interfaces (e.g. [[Apollo]]).
is not as popular for command-line work on Unix as Perl. It's
+
encountered in GMOD primarily as a language to construct user
+
interfaces (e.g. [[Apollo]]).
+
 
+
  
 
====Apache, the Web Server====
 
====Apache, the Web Server====
  
Anytime you want to set up a system that displays Web pages you will need a Web server. If someone else hasn't already installed this for you then you will want to use the [http://apache.org Apache Web server] (also known as the ''Apache HTTP Server''). Free of course, and secure and fast. It also turns out to be reasonably simple to install, on Unix or Windows.
+
If you want to set up an application that displays web pages, you will need a web server on your computer. If you don't already have one installed, you will want to use the [http://apache.org Apache Web server] (also known as the ''Apache HTTP Server''), which is free, fast, secure, and reasonably simple to install on Unix or Windows.
  
  
Line 368: Line 351:
 
===Licenses===
 
===Licenses===
  
Most [[GMOD Components]] have no restrictions on using them. However, a few do impose some restrictions.  These few components will highlight that they have restricted licenses.
+
Most [[GMOD Components]] have no restrictions on using them. Those few components that do impose restrictions will clearly state that they have restricted licenses.
 
+
 
+
  
 
[[Category:Needs Editing]]
 
[[Category:Needs Editing]]
 
[[Category:Help]]
 
[[Category:Help]]
 
[[Category:Biology]]
 
[[Category:Biology]]

Latest revision as of 19:56, 7 October 2013

... formerly titled "GMOD for the Biologist".

This page provides an overview of the GMOD project. It does not assume any particular background in computing.

Introduction

With the amount of technical documentation available for GMOD, the casual observer would be forgiven if they concluded that GMOD was a project about software. But it's not: GMOD has been created for biologists and in the real world, it is used by biologists. However, the creators of GMOD are mostly not practicing biologists and the look and the feel of most GMOD documentation reflects this. What we will attempt to do is discuss GMOD from the researchers' perspective. This does not simply mean describe what the software does. If you look, for example, at a typical GBrowse page (e.g. GBrowse view of human chromosome 7), you'll understand immediately what GBrowse is built to do, and a few more minutes of clicking and scrolling will reveal all sorts of useful ways to display and query the data. A modern biologist knows a great deal about bioinformatics functionality already. This introduction is more concerned with answering practical questions, such as given the data I have, what database should I use?", "do I even need a database?, or how hard is this going to be?.

In our experience we find that most biologists want to focus on the science. They may have little knowledge of programming languages or databases, and only passing interest in the IT minutiae. They have deep knowledge of their own data and know how such data can be viewed and analyzed. These biologists want to know how to efficiently create a useful set of tools for their data, and to be assured that their platform and tools can be easily maintained in an environment where resources may be limited. We will attempt to address these sorts of questions.

By the way, the word we used here refers to the GMOD Help Desk. The Help Desk is a good resource for biologists who want to learn more about GMOD. Feel free to email us at help@gmod.org.

What is a GMOD?

GMOD is a collection of interconnected applications and databases that biologists use as repositories and as tools. That connectivity is really the key here. Bioinformatics applications and databases are produced at a steady rate and this output is described each month in a number of different journals. There's no lack of tools, but many of these tools will be little used as prospective users may not have the resources or expertise required to install the tool and hook it up to their data. What is generally lacking is a concerted effort to produce tools and databases that will work together; GMOD fills this void by providing the means to store data and a comprehensive set of tools for manipulating that data.

GMOD also describes a community. Many of the GMOD software components are mature software with many human-years of software development behind them. The design, development, and testing has been driven by a diverse group of software developers, scientists, and laboratories that use or improve these software components every day. The demand for software like this has been strong since genome sequence started to appear and many of the first genome databases used GMOD components. GMOD database and software components have developed and expanded with the massive growth and development of genome projects and the changing needs of users.

GMOD is also that specific thing that is installed on your computer. It may be the private viewer to your latest data that a student set up over the weekend. It may a terabyte-size database and suite of web applications developed over many years at a central laboratory. It may a database of experimental data accessible by script, or it may the annotation tool that you use to describe your favorite genome.

GMOD does not claim to cover every potential data storage, analysis or manipulation request that a biologist may have, but GMOD is a project directed by its user base: the biologists lead and the software developers follow, not the other way around. If you find your predicament is not addressed, or is only partially addressed, by what's available in GMOD, your first step should be to contact the GMOD Help Desk or one of the main mailing lists like gmod-announce to make sure that your understanding of the available GMOD resources is correct. When you're in touch with some knowledgable person, try to get a sense of what the solution might be, or its degree of difficulty. It may be that your solution may entail something simple, or it may be that a project may have to be created, complete with partnerships and grants. GMOD participants are always interested in seeing GMOD take off in new directions.


Is It Just for Model Organisms?

GMOD stands for Generic Model Organism Database; it was named back in the days when there were a handful of model organisms and it appeared that obtaining the genomic sequence of an organism was a prohibitively expensive proposition, taking months or years to accomplish. With the ease and speed at which genomes can now be sequenced, few scientists would consider their organism a 'model' in this early sense of the word, so we suggest users think of the M as standing for Myriad or My.

Any organism with any kind of sequence associated with it is a good candidate as a subject for a GMOD database. There are GMOD databases with just protein sequence in them like the S. cerevisiae Proteome Browser. There are GMOD databases with EST sequence only, such as the Cattle EST Gene Family Database. There are GMOD databases that are concerned primarily with gene expression, such as the Emiliania huxleyi Serial Analysis of Gene Expression database. We even find GMOD databases dedicated to collections of RNA sequence like the Leishmania tarentolae RNA Editing database. We have also heard of GMOD databases for oligonucleotides and plasmids. See GMOD Users for a list of other examples. The list of GMOD databases demonstrates that GMOD is widely used, with many organisms represented, and that these databases can hold sequences of any kind.


Technologies

Mini-arch-diagram.png Most GMOD installations have a general architecture in common. There is a data source: the database. This does not have to be a relational database; it could be a file or a set of files with or without some kind of index. There's a lot of flexibility at the data level. Choosing this database and loading it will be tasks you'll give a lot of thought to.


What the user will see is one or more applications. These may be a set of web pages or a Java application. The choice of applications is dictated by the nature of your data. Sometimes the choice of application is easy or clear for a given kind of data. For some other data types you'll have to take a careful look at a few different applications and consider whether you want to invest more resources in order to create complex data representations or whether you want to expend less effort and offer something simpler.


There will also be software mediating the flow of information between the application and the database. Typically this is going to be a Web server: the Web server receives requests from the application and translates them to database queries, then receives data, formats it, and sends it back to the application. This piece of software can be thought as performing routine or mechanical tasks. It's important but you'll install it and typically pay little attention to it.

The Components of GMOD

GMOD is made up databases, applications, and adaptor software that connects these components together. Some of the most popular packages are discussed below.


What is GBrowse?

GBrowse is short for Genome Browser, or Generic Genome Browser. GBrowse is probably GMOD's most popular component and almost all of the databases listed in GMOD Users use GBrowse. It is fairly easy to install, requiring only basic command-line familiarity. GBrowse is popular is because it is a supremely capable browser. The picture below is a partial screenshot of a GBrowse page taken from the Human chromosome 7 database at TCAG. A bit of jargon: the rows, each depicting one sort of data, are called tracks and tracks are populated by one or more images called glyphs, plus text.


Gbrowse-glyphs.png


GBrowse comes with a large library of glyphs, including pie charts, dot plots, histograms, and X-Y plots suitable for quantitative data, as well as the expected array of glyphs that describe sequences and sequence annotation. It is also highly configurable, meaning you can do quite a bit of customization of the glyphs, you can link glyphs to URLs of your choice, you can internationalize the application to display different languages, you can connect and retrieve data from any database, and more. This sort of work generally requires either modifying GBrowse's configuration files or adding your own code. GBrowse is written in Perl; as GBrowse is designed to be customized, extending its functionality with your own code should not require expert coding skill.


JBrowse

JBrowse is a genome browser with a fully dynamic HTML5 user interface, being developed as the successor to GBrowse. It is very fast and scales well to large datasets. JBrowse is javascript-based and does almost all of its work directly in the user's web browser, with minimal requirements for the server. JBrowse's features include:

  • Fast, smooth scrolling and zooming. Explore your genome with unparalleled speed.
  • Scales easily to multi-gigabase genomes and deep-coverage sequencing.
  • Supports GFF3, BED, FASTA, Wiggle, BigWig, BAM, VCF (with tabix), REST, and more. BAM, BigWig, and VCF data are displayed directly from the compressed binary file with no conversion needed.
  • Very light server resource requirements. JBrowse has no back-end server code, just tools for formatting data files to be read directly over HTTP. Serve huge datasets from a single low-cost cloud instance.
JBrowse alignment and coverage

Screenshot of JBrowse in action

Relational Databases

For those unfamiliar with databases, the brief guide to databases provides a gentle introduction.

Relational databases are today's tool of choice when faced with the problem of storing complex or multifaceted data, assuming that the data is, or can be, broken down into ever smaller bits of data. All atomized data will end up in one field, analogous to the way that data can be organized as columns in a spreadsheet. Fields describing aspects of a "thing" (an entity) are organized together into tables. For example, a relational database may have a table called gene with gene.name and gene.geneid fields and a protein table with protein.name, protein.proteinid and protein.sequence fields.


Table-example.png


The picture above shows these two tables, and explains the term relational. The relation between the tables is the shared geneid - we add the geneid field to the protein table to indicate that the CFTR_1 protein record relates back to a specific gene in the gene table. This geneid field in protein, which originates in gene and whose values are stored in gene, is an example of a foreign key, a field from a table that is shared by one or more other tables.

For a given collection of data, genomic sequence and annotation for example, there will be more than one way to represent the data relationally. A given relational design, essentially tables and fields, is called a schema (think of the schema as a blueprint, empty, and the schema populated with data as the database). Both Chado and BioSQL can store genomic data for example, but they do it differently. The details of schema design are not relevant here but one can say that the designer may think about some of these general concerns:

  • The degree of data abstraction, which is related to a database concept called normalization and to the flexibility of the schema
  • The legibility of the schema, which has to do with the ease of using it
  • The breadth of the schema, in terms of the data types it could contain

From the scientific perspective one can ask related questions:

  • How flexible is a given schema?
  • Can it handle my data now and in the future?
  • Will using a given schema be easier or harder to use than some other schema?

This last question relates mostly to the degree of abstraction of the schema, not to the actual programming languages used.

All of today's relational databases are created and loaded and queried using one language, SQL; programmers use their chosen language (Perl, Java, Python, etc.) to execute SQL queries and process the results.


See also:

Chado and BioSQL

So when you choose to use a relational schema it will all really come down to you and your data, not technical details. Chado is one of the relational databases that are used in GMOD, the other being BioSQL.

The differences are clear. BioSQL is quite focussed and is concerned with:

  • Sequence
  • Sequence annotation
  • Phylogeny
  • Publications

It is also a thoroughly modern schema in that it uses OBO-style ontologies, such as the Gene Ontology (GO). This is a requirement, given the ubiquity of ontologies and their ability to describe and organize data.

Chado's focus is broader. Its tables are broken down into groups called modules; the modules are the following:


It is also possible to add modules to Chado. For instance, in early 2007 a module called mage was added to addresses microarray data. Other possibilities that are being discussed are modules for ecological data and additional work for phenotypic data, extending the existing phenotype module. The real point is that Chado has been designed to allow extensibility, and one can either formally propose that Chado acquire some new functionality as a module or you can add tables to Chado in the privacy of your own server.

Chado is also ontology-aware. One could state this even more forcefully: Chado depends on ontologies. For example, in Chado's Sequence module it is expected that all stored sequences are identified by one or more terms from the Sequence Ontology. A quick scan of the tables in Chado, more than 100, shows that about half of the tables contain the field, foreign key, cvterm, referring to an ontology term. The ontology used as source for a term could be one of many but people in the field tend to rely on OBO ontologies. So the ontology could be a common and general one like GO, the Gene Ontology, or something highly specific to a group of organisms like the Drosophila Anatomy ontology or the Mammalian Phenotype ontology. In conjunction with Chado, these ontologies give you a database that is extremely flexible, and as your ontologies expand, so does the expressive capability of the system.

There is a cost to this flexibility and breadth: Chado is complex and it is unlikely that someone unfamiliar with Chado would be able to install it and then immediately set about loading it with biological data of different sorts. Fortunately there are mailing lists you can contact, as well as the GMOD Help Desk, and number of pages on this Wiki discussing Chado (see Getting Started with Chado and the Chado Manual).

GFF Databases

In addition to relational database schemas like Chado and BioSQL you will also encounter GFF databases. GFF is a compact format for describing sequence and sequence annotations. GMOD installations like the Human Chromosome 1 database described above are concerned solely with sequence and annotation and the entire contents of such a database can be represented as GFF. For small installations the entire database can be just a set of GFF text files (in fact, you can install GBrowse on your personal computer and then browse Saccharomyces and Volvox genomic sequence, reading directly from GFF files installed along with GBrowse - try it!). But when the amount of GFF gets too large to be read into memory all at once you have to store the GFF in some form that's indexed for fast retrieval. The solution is to load the GFF into MySQL or some other sort of database management system, this assures good performance even if you have very large amounts of data in GFF format. This is accomplished by using the Bio::DB::GFF or Bio::DB::SeqFeature GBrowse Adaptors.

What are WebApollo and Apollo?

Unlike JBrowse and GBrowse, which only function as a sequence browser, WebApollo and its standalone predecessor Apollo, are for both viewing and manually annotating genomes. WebApollo is a plugin for JBrowse that allows multiple users to annotate genomes concurrently. Changes made by others are automatically and immediately updated in the user's browser window, ensuring that there is no duplication of effort, and allowing several users to annotate parts of the same sequence at the same time. A full history of all edits is kept, and the changes made in editing sessions can be approved or rejected by an administrator before being saved. WebApollo shares JBrowse's fast, flexible browsing interface, and users require only a web browser to use it.

Apollo is a standalone Java application for manual sequence annotation, and is the predecessor of WebApollo. Apollo can read and write to Chado databases, but lacks the instant updates that WebApollo features. We recommend using WebApollo as it is under active development and has a more full feature set than Apollo.

What are MAKER and DIYA?

GBrowse and Apollo both deal with genome annotations, but where do these annotations come from? Frequently they come from a genome annotation pipeline, a software package or series of software packages that take an assembly (and other things) as input and produces an annotated genome, often with gene models, ESTs, proteins, and almost anything else that can be tied back to a genomic sequence.

MAKER is a genome annotation pipeline that produces annotated eukaryotic genomes, and DIYA is a genome annotation pipeline for prokaryotic genomes (and both do more than that too). They both produce gene models in GFF, a file format that can be directly loaded into GBrowse, Apollo, and Chado.

What is Pathway Tools?

Pathway Tools is a software system for creating organism-specific databases. It contains extensive functionality that spans from genomes to pathways including a genome browser, metabolic pathway predictor and viewer, and regulatory network viewer, as well as a large number of interactive annotation tools.

What is CMap?

CMap is a popular comparative map viewer. It was initially created for use at Gramene but was redesigned to be used for any organism or set of organisms. It can display genetic maps or physical maps and draw the relations between the two. It will also show synteny. It is written in Perl and requires an underlying RDBMS such as MySQL. If you need to display maps or syntenic relationships, you may need more than GBrowse.


And SynView? or Sybil? or GBrowse_Syn?

Yes, there are other comparative genomics viewers. The alternatives to CMap are GBrowse_syn, Sybil, and SynView. Sybil stores its data in Chado and accommodates a variety of different analyses; go to the Sybil Web site if you want to learn more. GBrowse_syn and SynView build upon GBrowse, and they can be considered a bit simpler than Sybil and CMap. More information is available on their websites to help you determine which is most suitable for you.


What is Tripal?

Tripal is a web frontend for a Chado database that provides both an attractive, slick website for accessing and disseminating Chado data, and an interface for local users to upload and edit data in the database. Tripal is based on the popular content management system Drupal, and creates a customisable website from the Chado database. Tripal includes a number of analysis modules that allow the incorporation of external data (for example, Gene Ontology annotations), and tools such as GBrowse, Galaxy, and CMap can be integrated into the site. Tripal is very customisable and, as it is based on Drupal, extra website content can easily be added using the standard Drupal functionality. For groups looking to avoid the substantial investment of time and effort involved in creating a website for data display and dissemination, Tripal offers a simple, stylish solution.

What is Modware?

Modware is a middleware package used in GMOD, written in Perl. Middleware is software that mediates the exchange of information between applications, e.g. between web pages and databases; please see the GMOD Middleware page for technical details. GMOD developers have evaluated a number of Perl middleware packages and decided that Modware is best suited to GMOD Perl development. Like Bioperl, Modware may be a package that you may need to install but won't need to understand in any detail.


What is BioPerl?

BioPerl is a popular bioinformatics toolkit written in Perl. The reason we mention it here is because many of the GMOD Components use parts of it. You will not have to learn BioPerl in order to use GMOD but you may have to install it.

BioPerl offers some attractive ways to store genomic data without requiring a relational database. We discussed Chado and BioSQL above; these two relational schemas require the prior installation of some free, open source RDBMS like MySQL or PostgreSQL. Installing these pieces, schema plus RDBMS, is not necessarily difficult, but if you only have sequence and sequence annotation, you can set up a sequence or genome browser using BioPerl, GBrowse, and an Apache web server. You can use either the Bio::DB::GFF module from BioPerl or the Bio::DB::SeqFeature module. See A Simple Sequence Browser below.

And What Else is in GMOD?

A number of other software packages, listed below, classified by general function. One might be tempted to think of this as a shopping list, choosing one of each. But it may also be useful to think of what is absolutely essential first and consider these other components as add-ons. Some of these components are only loosely coupled to some of the more core components described above; an application might use its own methods to store data and not use Chado. Or, a component may be written in Java and not Perl, so it would not be able to communicate with a Perl application. For something to be considered a GMOD component it does not, at this time, have to connect to some other component.


Community Annotation

Apollo

BioDIG

Canto

WebApollo

Wiki TableEdit


Comparative Genome Visualization

CMap

GBrowse_syn

Pathway Tools

SynView

Sybil


Database schema

Chado


Database tools

Argos

BioMart

Genome grid

GMODTools

InterMine

LuceGene

XORT


Gene Expression Visualization

Caryoscope

GeneXplorer

Pathway Tools


Genome Annotation

Apollo

DIYA

MAKER

SOBA

WebApollo


Genome Visualization & Editing

Apollo

Flash GViewer

GBrowse

JBrowse

Pathway Tools

WebGBrowse

WebApollo


Literature and Curation Tools

BioDIG

Canto

Textpresso

Molecular Pathway Visualization

Pathway Tools


Ontology Visualization

Go Graphic Viewer


Workflow Management

Galaxy

Ergatis

DIYA

ISGA


Middleware

Modware

Chado::AutoDBI

Bio::Chado::Schema


Tool Integration

Galaxy


Sequence Alignment

Blast Graphic


Website front end for Chado DB

Tripal

Case Studies

This page or section needs to be edited. Please help by editing this page to add your revisions or additions.

What we are attempting to do here is anticipate some of the basic requirements of the scientist. The classic situation is that he or she has data of some type, or of many different types, and needs to set up both a data repository and a viewer on this data. We are assuming that the scientist is not a programmer or an IT expert but that he or she is willing to learn the necessary skills or has a student available to do the required work.

A Simple Sequence Browser

  • The data: sequence (genomic DNA or ESTs or proteins or cDNAs or some combination of these or ...)
  • The goal: create a browser to query and view sequence and sequence annotations
  • The core software: GBrowse, Apache Web server, and Bioperl
  • The hardware: a server running Unix (Linux or Mac) or Windows
  1. Figure out what the annotations should be (i.e. gene coordinates, motif matches, oligonucleotide matches, etc. You can try using annotation pipelines like MAKER to automatically build these.)
  2. Install core software
  3. Create or gather the annotations (BLAST results or HMMER results or GenBank files or ...)
  4. Transform all the annotations into a format suitable for loading (GFF format)
  5. Load GFF into the GFF database
  6. Configure GBrowse

Possible challenge: Step 4, converting all the annotations to GFF (scripts may available to perform all the conversions, or you may have to write some of the conversion code yourselves)

  • Skills needed: basic command-line competence, perhaps basic Perl competence if you have to write any custom conversion code
  • Resources available: documentation at GMOD.org, the GMOD Help Desk, the GMOD Mailing Lists

Recommendation

Highly recommended. Setting this up will give you a good sense of how the software pieces interoperate. Not only that, but GBrowse is fun and it comes with sample databases so once it's installed you have actual genome sequence to play with. You can even get GBrowse running nicely on a laptop.


A Simple Sequence Browser plus a Sequence Annotator

  • The data: sequence (genomic DNA or ESTs or cDNAs or some combination of these or ...)
  • The goal: create a browser to query and view sequence and sequence annotations along with an editor to manually annotate the sequences
  • The core software: GBrowse, Apollo, Chado (plus relational database), Apache Web server, and BioPerl
  • The hardware: a server running Unix (Linux or Mac) or Windows
  1. Figure out what the annotations should be (gene coordinates or motif matches or oligonucleotide matches or hand-made annotations or some combination of these or ...)
  2. Install core software
  3. Create or gather the annotations (BLAST results or HMMER results or GenBank files or ...)
  4. Transform all the annotations into a format suitable for loading (GFF format)
  5. Load GFF into the Chado database
  6. Install and configure Gbrowse
  7. Install and configure Apollo

A challenge: Step 2, installing core software (with more components you have a more complex system and more potential pitfalls, and Chado and its relational database is a fairly detailed install)

Possible challenge: Step 4, converting all the annotations to GFF (scripts may available to perform all the conversions, or you may have to write some of the conversion code yourselves)

Skills needed: basic command-line competence, perhaps basic Perl competence if you have to write any custom conversion code. Some understanding of relational databases for the Chado installation. Basic Java competence for the Apollo installation.

Resources available: documentation at www.gmod.org, the GMOD Help Desk, the GMOD mailing lists

Recommendation

If you're a GMOD novice, install GBrowse by itself first (A Simple Sequence Browser), then consider this system.


A Browser for a Stock Collection

  • The data: the stock collection data in some structured form (Excel or Word or ...)
  • The goal: create a browser to query and view your laboratory's stock collection
  • The core software: Chado (and its relational database), Apache Web server, and Turnkey
  • The hardware: a server running the Unix (Linux or Mac) or Windows operating system.
  1. Install core software
  2. Load stock collection data into the Chado database
  3. Create a Tripal-based Web site

Challenges:

  • Possible challenge: Step 1, installing core software (Chado and its relational database is a fairly detailed install)
  • A challenge: Step 2, loading the stock collection data into Chado (scripts will not be available to perform this loading, you will have to create the code yourselves)
  • Possible challenge: Step 2, loading the data. The Chado schema may not be properly configured for your data and may need to be modified.
  • Possible challenge: Step 3, running Turnkey to automatically create your browser. Turnkey is a new tool. It has been used successfully in testing and at ParameciumDB but not all possibilities have been tested.

Skills needed: General IT expertise (Turnkey automates the creation of Web sites but it is an expert's tool) Basic programming competence to write the custom conversion code.

Resources available: documentation at www.gmod.org, the GMOD Help Desk, the GMOD Mailing Lists.

Recommendation

Consider whether you want to explore uncharted territory or not. Could be fairly straightforward for the expert, or could be challenging.

A Browser for Microarray Data

  • The data: microarray data in Affymetrix format
  • The goal: create a browser to query and view your laboratory's microarray
  • The core software: Chado, Apache Web server, and ...
  • The hardware: a server running Unix (Linux or Mac) or Windows

Challenge: Chado can hold the microarray data using its Mage module and applications exist to view raw microarray data (e.g. Caryoscope, GeneXplorer) but these applications don't connect to Chado.

Resources available: documentation at www.gmod.org, the GMOD Help Desk, the GMOD mailing lists

Recommendation

Either wait for the connectors to be built to some application or form a partnership with GMOD scientists and developers to see that the connectors are built.

A Browser for Map Data

  • The data: map data (genetic map data or physical map data or visual map data or some combination of these)
  • The goal: create a browser to query and view your maps, within a species or across species
  • The core software: GBrowse, Apache Web server, and CMap or SynView or Sybil.
  • The hardware: a server running Unix (Linux or Mac) or Windows
  1. Choose the right map software, based on your map data and resources.
  2. Install core software.
  3. Load map data.

Possible challenge: Step 2, the installation. This may tricky if you choose one of the more fully featured packages (CMap or Sybil).

Possible challenge: Step 3, the loading. It is likely that some custom coding would be required since map data comes in all sorts of different forms.

Skills needed: Basic command-line competence. Some understanding of relational databases for CMap or Sybil. Basic programming competence to write the custom loading code.

Resources available: documentation at www.gmod.org, the GMOD Help Desk, the GMOD mailing lists

Recommendation

Choose one. GMOD offers good choices here, it comes down to your data and your resources. SynView is the easiest, and it comes with GBrowse.


Computing

Personnel, Hardware and Operating System

Computing Requirements discusses the personnel, hardware, and operating system requirements and choices for implementing GMOD components.


Software

GMOD software relies on other software to function. This section lists some other key open source packages that you may need.


Databases

The Relational Databases section above introduced many relational database concepts. Databases and GMOD discuses database management system choices in GMOD. It also introduces some additional terminology.


Programming Languages

Two programming languages are popular in GMOD: Perl and Java. For most tasks you won't need to do any programming in either language. You will just need to know how to install these languages and how to install programs written in these languages. See Computing Requirements for more.

Perl

The programming language most used in the bioinformatics realm. Also the language most used by GMOD developers. It is well-suited to text and data processing and is also characterized by an extensive open source library, so it's highly functional. Many of GMOD components use BioPerl, a bioinformatics toolkit written in Perl.

Some pieces of GMOD, like GBrowse, can be extended or customized using Perl but beginners' skills in Perl would be sufficient for this work. Just installing and using GBrowse in a conventional way does not require knowledge of Perl or BioPerl.

Java

Java is arguably the world's most popular programming language but it is not as popular for command-line work on Unix as Perl. It's encountered in GMOD primarily as a language to construct user interfaces (e.g. Apollo).

Apache, the Web Server

If you want to set up an application that displays web pages, you will need a web server on your computer. If you don't already have one installed, you will want to use the Apache Web server (also known as the Apache HTTP Server), which is free, fast, secure, and reasonably simple to install on Unix or Windows.


Glossary

The GMOD Glossary explains many terms related to GMOD, bioinformatics, and the computing technologies used in GMOD.


Licenses

Most GMOD Components have no restrictions on using them. Those few components that do impose restrictions will clearly state that they have restricted licenses.