This introduces the broad topic of databases in GMOD. It introduces some database terminology for those that are new to databases. It also covers how databases are implemented and used in GMOD, and what database management system choices are available.
The term database is mentioned in all sorts of contexts in GMOD. It is even part of the project’s name. Despite its central role in GMOD, the term database is often used to mean four different things. This section distinguishes its various meanings and introduces more precise terms that should be (but probably aren’t) used throughout GMOD.
A database is any organized set of data that is readable by a computer. This may or may not use a formal database schema and may or may not be in a database management system.
A database can be implemented in a relational database management system using a defined database schema. For example, the database behind the FlyBase web site contains data on drosopholids, and uses the Chado schema and the PostgreSQL database management system.
A database can also be in regular files with a well defined format.
Database management systems (DBMSs) are software systems that can manage data. Oracle, MySQL, PostgreSQL, and Sybase are all examples of DBMSs. DBMSs are containers of databases. That is, they are the systems that manage databases, which is distinct from the data that they manage.
A database schema is the design of a particular database, independent of its contents. Chado is an example of a database schema. Designs (like Chado) can be reused across multiple databases.
Web sites that feature a lot of database driven content, such as FlyBase (http://flybase.org) or ParameciumDB (http://paramecium.cgm.cnrs-gif.fr), are often referred to as databases. This is somewhat accurate as there are databases backing the web sites, but it is also misleading. These websites also show information that doesn’t come from their database and they also may not show everything in their databases.
The brief guide to databases defines some basic database concepts, and the Relational Databases section in Overview explains many terms that are useful in understanding relational databases.
There are three main GMOD components that are fundamentally about databases, and several more that help you manage databases or that use (or can use) databases to accomplish their purpose.
GMOD’s database related components are:
Chado is the modular database schema of GMOD. Chado is about organizing your data in a database so that you can manage it and can connect other GMOD components to it (either directly or via data exports). When someone speaks of the GMOD Schema they are speaking about Chado.
BioMart is a data warehouse package tailored for biological data. It takes existing databases (for example, the FlyBase Chado database), transforms them into a data warehouse and then provides a web interface for supporting arbitrary queries against the data.
InterMine also integrates multiple data sources into a single data warehouse. It has a core data model based on the sequence ontology and supports several biological data formats. It is easy to extend the data model and integrate your own data, Java and Perl APIs and an XML format to help import custom data. A web application allows creation of custom queries, includes template queries (web forms to run ‘canned’ queries) and can upload and operate on lists of data. Many aspects of the web app can be configured and branded.
Several GMOD Components rely on databases to store their data. All such components have a default DBMS that the developers had in mind when they created the component. The default DBMS is most often PostgreSQL or MySQL. PostgreSQL, commonly known as Postgres, and MySQL are both open-source DBMSs with large and active user communities. It is possible to use a DBMS other than the default but it does involve more work, sometimes a lot more work.
See the component descriptions to find out if they need an underlaying database and what their default DBMS is.
The answer is yes, but it will mean extra work.
You may want to do this if you are already using a DBMS that you understand. DBMS administration is non-trivial and adding one or two more DBMSs to the list you have to support may or may not be more effort than porting the component to use your DBMS of choice. However, do keep in mind that one of the reasons why MySQL and Postgres are often picked as default DBMSs is that they are comparatively easy to administer.
Postgres and MySQL are the most popular DBMSs, but several others are in use in the GMOD community.
Postgres (officially known as PostgreSQL) is the default DBMS for Chado, GMOD’s modular database schema. Galaxy can also run using Postgres. Work is being done on Ergatis to also support PostgreSQL.
See the PostgreSQL page from more information on Postgres.
MySQL adapters exist for GBrowse and it also used by CMap
Using MySQL with Chado is discouraged. See MySQL for more information on MySQL in GMOD.
SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain.
SQLite is used in a few places in GMOD. MAKER uses it for its data store. It is the default database when installing Galaxy, but most users then upgrade to a different DBMS, usually PostgreSQL. It can also be used with GBrowse.
Xenbase uses DB2 for their Chado installation. DB2 is a high-end database from IBM that has a free version and also a free academic licenses. DB2 is one of the big players in the commercial database market.
Oracle is a supported DBMS for CMap and GBrowse.
ApiDB uses Oracle for its database needs, and dictyBase uses Oracle for Chado and GMODTools.
Work is being done on Ergatis to also support Oracle.
Oracle is a high-end database management system from Oracle Corporation. It is the most popular commercial database in the world.
Due to its heritage at JCVI, the default database of the Ergatis workflow management tool was originally Sybase. Currently, it is usually used with MySQL and PostgreSQL by way of the Coati development framework.
Relational Databases in Overview.