Chado General Module

From GMOD
Revision as of 17:03, 14 February 2007 by Bosborne (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

General purpose tables are housed in the module general. The primary purpose of this module is to provide a means of providing data entities with stable, unique identifiers. In Chado, all identifiable data entities have bipartite identifiers, consisting of a dbname plus an accession, together with an optional version suffix.

By convention, these are normally presented using a ’:’ separator. An example of an identifier in this notation would be GO:0008045 or FlyBase:FBgn00000001. In the Chado schema the atomic units are the dbname and the accession, the separator is introduced only in the presentation layer. Each dbname uniquely identifies the authority responsible for a particular ID-space (so there cannot be two GO in any single Chado instance). The accession must be unique within the ID-space. Thus there can be two accessions 0008045, but there can only be one data artefact identified as GO:0008045.

These uniqueness constraints are encoded in the schema, so it is impossible for any Chado relational database instance to violate them.

Each identifier is stored as a row in the dbxref table, with the dbname stored in the db table. Keeping the dbname in a separate db table ensures that the Chado schema retains its commitment to normalization. Entries in other tables can refer to entries in the dbxref table by means of foreign keys.

Note that all stable identifiers are stored in the dbxref table, whether or not they refer to ’external’ data entities. Chado does not have an explicit notion of a data entity being external. Some dbxrefs have further information fully fleshed out in other tables in the database, and others are ’dangling’ dbxrefs.


2.2Design patterns


Primary identifiers: ENTITY.dbxref id REFERENCES dbxref(dbxref id)


Secondary identifiers: ENTITY DBXREF.dbxref id


2.3 Tables


The two main tables are dbxref (for the identifier itself) and db (for the name of the DB or ID- granting authority). By separating the db into its own table rather than duplicating the name in the dbxref we retain normalization

A dbxref identifier has two key parts: a db id column that refers to an entry in the db table, and an accession column, that must be a locally unique identifier within the db referred to by the db id column. An optional third column is the version column. Taken together, these 3 columns constitute a unique key.

The db is a database authority. Typical dbs in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this sortened form (the db.name, which is unique within the bioinformatics and biomedical realm. See below for more on uniqueness. This name is typically in short mnemonic (but human-friendly) form, and uniquely identifies the DB/authority (enforced by uniqueness constraint). Examples include FlyBase, GO, MGI. Short human-friendly names are encouraged, although longer names (such as full LSID prefixes) may also be used. The name should be a valid XML NMTOKEN (see XML specification for details) - for example, it should not start with a number. This constraint is to help syntactic interoperability with other identifier schemes. To ensure interoperability with other Chado databases, the same db.names should be used (e.g. FlyBase should be used consistently instead of FB). This will prevent duplicate dbxref rows being created if and when databases are merged. At the same time, uniqueness must be preserved: there must not be two GOs. See below for more information.


2.4 URLs and URIs


See the following for background:

http://en.wikipedia.org/wiki/Uniform Resource Identifier

http://en.wikipedia.org/wiki/URL

Basically, a URI is an addressing scheme. The form of URI most people are familiar with are URLs; but not all URIs are URLs. Another URI addressing scheme is the URN; for example, LSIDs use URNs.

People often expect URLs to be resolvable using standard technology (eg a web browser) to a resource intended for humans, but this isn’t always the case. URNs may require other software to resolve them; eg LSID resolver.

The db table has columns for both URL and URI. The former is intended just to go to a website like the FlyBase or GO home page. The latter is intended as a globally unique addressing scheme that should be interoperable with other schemes. For example GO may be a unique identifier for the Gene Ontology ID space by fiat within the bioinformatics community, but not outside. Although Chado only cares about the former, it may have to interoperate with schemes that care about truly


global uniqueness, hence URIs.

This column is nullable, so it is possible to defer decision on what the unique URI for a particular authority is if this information is not known up-front. See below for mechanisms for assigning URIs to DBs and ensuring uniqueness.


2.5 Identifiers and interoperability between Chado instances


2.6 Table Definitions


tableinfo


NULL


 Table 2.1: tableinfo

Column Datatype Description tableinfo id integer name varchar primary key column varchar is view integer view on table id integer superclass table id integer is updateable integer modification date date


db


A database authority. Typical dbs in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this sortened form, which is unique within the bioinformatics and biomedical realm.


Table 2.2: db

ColumnDatatype Description db id integer name varchar A (typically short, mnemonic) name of the ID-space, database or ID-granting authority.The db.name uniquely identifies the DB/authority. Examples include FlyBase, GO, MGI. Short human-friendly names are encouraged, although longer names (such as full LSID prefixes) may also be used. The name should be a valid XML NMTOKEN (see XML speci- fication for details) - for example, it should not start with a number. This constraint is to help syntac- tic interoperability with other identifier schemes. To ensure interoperability with other Chado databases, the same db.names should be used (e.g. FlyBase should be used consistently instead of FB). This will prevent duplicate dbxref rows being created if and when databases are merged. At the same time, uniqueness must be preserved: there must not be two ”GO”s. See supporting docs for more info description varchar contact id int, urlprefix varchar urlvarchar A W3C compliant URL with the address of a website containing information about the DB/authority. For example, http://www.flybase.org, http://www.geneontology.org. The URL is in- tended for humans rather than software agents urivarchar A W3C compliant URI that contains a unique namespace for the DB/authority. Some ID schemes (eg LSID) require this. The URI is intended for soft- ware agents rather than humans. It does not need to be a resolvable URL. However, certain DBs may pre- fer the URI to be a resolvable URL that has human- readable information on the other end. Other DBs may provide URNs (eg LSID URNs) that require software agents to be resolved. Note that it is per- fectly acceptable for the db.name column to be the same as the URI column (provided it is a valid URI). However, it is encouraged that a short form is used as the db.name. See supporting docs for more infor- mation


dbxref


A unique, global, public, stable identifier. Not necessarily an eXternal reference - can reference data items inside the particular chado instance being used. Typically a row in a table can be uniquely identified with a primary identifier (called dbxref id); a table may also have secondary identifiers (in a linking table ¡T¿ dbxref). A dbxref is generally written as ¡DB¿:¡ACCESSION¿ or as ¡DB¿:¡ACCESSION¿.the ID-spa¡VERSION¿.


 Table 2.3: dbxref
ColumnDatatype Description
dbxref idinteger
db id integer
accessionvarchar  The local part of the identifier. Guaranteed by the
db authority to be unique for that db.
version  varchar
description text


project


 Table 2.4: project
 ColumnDatatype Description
 project id  integer
 name  varchar
 description varchar