General purpose tables are housed in the module general. The primary purpose of this module is to provide a means of providing data entities with stable, unique identifiers. In Chado, all identifiable data entities have bipartite identifiers, consisting of a dbname plus an accession, together with an optional version suffix.
By convention, these are normally presented using a ’:’ separator. An example of an identifier in this notation would be GO:0008045 or FlyBase:FBgn00000001. In the Chado schema the atomic units are the dbname and the accession, the separator is introduced only in the presentation layer. Each dbname uniquely identifies the authority responsible for a particular ID-space (so there cannot be two GO in any single Chado instance). The accession must be unique within the ID-space. Thus there can be two accessions 0008045, but there can only be one data artefact identified as GO:0008045.
These uniqueness constraints are encoded in the schema, so it is impossible for any Chado relational database instance to violate them.
Each identifier is stored as a row in the dbxref table, with the dbname stored in the db table. Keeping the dbname in a separate db table ensures that the Chado schema retains its commitment to normalization. Entries in other tables can refer to entries in the dbxref table by means of foreign keys.
Note that all stable identifiers are stored in the dbxref table, whether or not they refer to ’external’ data entities. Chado does not have an explicit notion of a data entity being external. Some dbxrefs have further information fully fleshed out in other tables in the database, and others are ’dangling’ dbxrefs.
The two main tables are dbxref (for the identifier itself) and db (for the name of the DB or ID- granting authority). Giving the database authority its own table, rather than repeatedly duplicating the name in the dbxref table, retains data normalization.
A dbxref identifier has two key parts: a db id column that refers to an entry in the db table, and an accession column, that must be a locally unique identifier within the db referred to by the db id column. An optional third column is the version column. Taken together, these 3 columns constitute a unique key.
The db is a database authority. Typical dbs in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this sortened form (the db.name, which is unique within the bioinformatics and biomedical realm. See below for more on uniqueness. This name is typically in short mnemonic (but human-friendly) form, and uniquely identifies the DB/authority (enforced by uniqueness constraint). Short human-friendly names are encouraged, although longer names (such as full LSID prefixes) may also be used. The name should be a valid XML NMTOKEN (see XML specification for details) - for example, it should not start with a number. This constraint is to help syntactic interoperability with other identifier schemes. To ensure interoperability with other Chado databases, the same db.names should be used (e.g. FlyBase should be used consistently instead of FB). This will prevent duplicate dbxref rows being created if and when databases are merged. At the same time, uniqueness must be preserved: there must not be two GOs.
See the following for background:
Basically, a URI is an addressing scheme. The form of URI most people are familiar with are URLs; but not all URIs are URLs. Another URI addressing scheme is the URN; for example, LSIDs use URNs.
People often expect URLs to be resolvable using standard technology (e.g. a web browser) to a resource intended for humans, but this isn’t always the case. URNs may require other software to resolve them; e.g. an LSID resolver.
This column is nullable, so it is possible to defer decision on what the unique URI for a particular authority is if this information is not known up-front. See below for mechanisms for assigning URIs to DBs and ensuring uniqueness.
Note that it is perfectly acceptable for the db.name column to be the same as the url column (provided it is a valid URI). However, it is encouraged that a short form is used as the db.name.
A W3C compliant URL with the address of a website containing information about the DB or authority. For example, http://www.flybase.org, http://www.geneontology.org. The URL is intended for humans rather than software agents.
A W3C compliant URI that contains a unique namespace for the DB/authority. Some ID schemes (e.g. LSID) require this. The URI is intended for software agents rather than humans. It does not need to be a resolvable URL. However, certain DBs may prefer the URI to be a resolvable URL that has human-readable information on the other end. Other DBs may provide URNs (e.g. LSID URNs) that require software agents to be resolved.
The db table probably should have columns for both URL and URI. The former is intended just to go to a website like the FlyBase or GO home page. The latter is intended as a globally unique addressing scheme that should be interoperable with other schemes. For example GO may be a unique identifier for the Gene Ontology ID space by fiat within the bioinformatics community, but not outside. Although Chado only cares about the former, it may have to interoperate with schemes that care about truly global uniqueness, hence URIs.
The db table contains one row per database authority, that is, one row per curator/creator of bioinformatic data collections. Typical databases in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this shortened form, which is unique within the bioinformatics and biomedical realm. To Do - add support for URIs, URNs (e.g. LSIDs). We can do this by treating the URL as a URI - however, some applications may expect this to be resolvable - to be decided.
F-Key | Name | Type | Description |
---|---|---|---|
db_id | serial | PRIMARY KEY | |
name | character varying(255) | UNIQUE NOT NULL | |
description | character varying(255) | ||
urlprefix | character varying(255) | ||
url | character varying(255) |
db Structure
Tables referencing this one via Foreign Key Constraints:
The dbxref table contains one row per version per collection of bioinformatic data, one row per Chado “database”. The table provides a unique, global, public, stable identifier that can be used to reference a database version. Not necessarily an external reference - can reference data items inside the particular chado instance being used. Typically a row in a table can be uniquely identified with a primary identifier (called dbxref_id); a table may also have secondary identifiers (in a linking table <T>_dbxref). A dbxref is generally written as <DB>:<ACCESSION> or as <DB>:<ACCESSION>:<VERSION>.
F-Key | Name | Type | Description |
---|---|---|---|
dbxref_id | serial | PRIMARY KEY | |
db_id | integer | UNIQUE#1 NOT NULL | |
accession | character varying(255) | UNIQUE#1 NOT NULL The local part of the identifier. Guaranteed by the db authority to be unique for that db. |
|
version | character varying(255) | UNIQUE#1 NOT NULL DEFAULT ''::character varying | |
description | text |
dbxref Structure
Tables referencing this one via Foreign Key Constraints:
F-Key | Name | Type | Description |
---|---|---|---|
project_id | serial | PRIMARY KEY | |
name | character varying(255) | UNIQUE NOT NULL | |
description | character varying(255) | NOT NULL |
project Structure
Tables referencing this one via Foreign Key Constraints:
F-Key | Name | Type | Description |
---|---|---|---|
tableinfo_id | serial | PRIMARY KEY | |
name | character varying(30) | UNIQUE NOT NULL | |
primary_key_column | character varying(30) | ||
is_view | integer | NOT NULL | |
view_on_table_id | integer | ||
superclass_table_id | integer | ||
is_updateable | integer | NOT NULL DEFAULT 1 | |
modification_date | date | NOT NULL DEFAULT now() |
tableinfo Structure
Tables referencing this one via Foreign Key Constraints: