Chado General Module
General purpose tables are housed in the module general. The primary purpose of this module is to provide a means of providing data entities with stable, unique identiﬁers. In Chado, all identiﬁable data entities have bipartite identiﬁers, consisting of a dbname plus an accession, together with an optional version suﬃx.
By convention, these are normally presented using a ’:’ separator. An example of an identiﬁer in this notation would be GO:0008045 or FlyBase:FBgn00000001. In the Chado schema the atomic units are the dbname and the accession, the separator is introduced only in the presentation layer. Each dbname uniquely identiﬁes the authority responsible for a particular ID-space (so there cannot be two GO in any single Chado instance). The accession must be unique within the ID-space. Thus there can be two accessions 0008045, but there can only be one data artefact identiﬁed as GO:0008045.
These uniqueness constraints are encoded in the schema, so it is impossible for any Chado relational database instance to violate them.
Each identiﬁer is stored as a row in the dbxref table, with the dbname stored in the db table. Keeping the dbname in a separate db table ensures that the Chado schema retains its commitment to normalization. Entries in other tables can refer to entries in the dbxref table by means of foreign keys.
Note that all stable identiﬁers are stored in the dbxref table, whether or not they refer to ’external’ data entities. Chado does not have an explicit notion of a data entity being external. Some dbxrefs have further information fully ﬂeshed out in other tables in the database, and others are ’dangling’ dbxrefs.
Primary identiﬁers: ENTITY.dbxref id REFERENCES dbxref(dbxref id)
Secondary identiﬁers: ENTITY DBXREF.dbxref id
The two main tables are dbxref (for the identiﬁer itself) and db (for the name of the DB or ID- granting authority). By separating the db into its own table rather than duplicating the name in the dbxref we retain normalization
A dbxref identiﬁer has two key parts: a db id column that refers to an entry in the db table, and an accession column, that must be a locally unique identiﬁer within the db referred to by the db id column. An optional third column is the version column. Taken together, these 3 columns constitute a unique key.
The db is a database authority. Typical dbs in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this sortened form (the db.name, which is unique within the bioinformatics and biomedical realm. See below for more on uniqueness. This name is typically in short mnemonic (but human-friendly) form, and uniquely identiﬁes the DB/authority (enforced by uniqueness constraint). Examples include FlyBase, GO, MGI. Short human-friendly names are encouraged, although longer names (such as full LSID preﬁxes) may also be used. The name should be a valid XML NMTOKEN (see XML speciﬁcation for details) - for example, it should not start with a number. This constraint is to help syntactic interoperability with other identiﬁer schemes. To ensure interoperability with other Chado databases, the same db.names should be used (e.g. FlyBase should be used consistently instead of FB). This will prevent duplicate dbxref rows being created if and when databases are merged. At the same time, uniqueness must be preserved: there must not be two GOs. See below for more information.
2.4 URLs and URIs
See the following for background:
http://en.wikipedia.org/wiki/Uniform Resource Identiﬁer
Basically, a URI is an addressing scheme. The form of URI most people are familiar with are URLs; but not all URIs are URLs. Another URI addressing scheme is the URN; for example, LSIDs use URNs.
People often expect URLs to be resolvable using standard technology (eg a web browser) to a resource intended for humans, but this isn’t always the case. URNs may require other software to resolve them; eg LSID resolver.
The db table has columns for both URL and URI. The former is intended just to go to a website like the FlyBase or GO home page. The latter is intended as a globally unique addressing scheme that should be interoperable with other schemes. For example GO may be a unique identiﬁer for the Gene Ontology ID space by ﬁat within the bioinformatics community, but not outside. Although Chado only cares about the former, it may have to interoperate with schemes that care about truly
global uniqueness, hence URIs.
This column is nullable, so it is possible to defer decision on what the unique URI for a particular authority is if this information is not known up-front. See below for mechanisms for assigning URIs to DBs and ensuring uniqueness.
2.5 Identiﬁers and interoperability between Chado instances
2.6 Table Deﬁnitions
Table 2.1: tableinfo
Column Datatype Description tableinfo id integer name varchar primary key column varchar is view integer view on table id integer superclass table id integer is updateable integer modiﬁcation date date
A database authority. Typical dbs in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this sortened form, which is unique within the bioinformatics and biomedical realm.
Table 2.2: db
ColumnDatatype Description db id integer name varchar A (typically short, mnemonic) name of the ID-space, database or ID-granting authority.The db.name uniquely identiﬁes the DB/authority. Examples include FlyBase, GO, MGI. Short human-friendly names are encouraged, although longer names (such as full LSID preﬁxes) may also be used. The name should be a valid XML NMTOKEN (see XML speci- ﬁcation for details) - for example, it should not start with a number. This constraint is to help syntac- tic interoperability with other identiﬁer schemes. To ensure interoperability with other Chado databases, the same db.names should be used (e.g. FlyBase should be used consistently instead of FB). This will prevent duplicate dbxref rows being created if and when databases are merged. At the same time, uniqueness must be preserved: there must not be two ”GO”s. See supporting docs for more info description varchar contact id int, urlpreﬁx varchar urlvarchar A W3C compliant URL with the address of a website containing information about the DB/authority. For example, http://www.ﬂybase.org, http://www.geneontology.org. The URL is in- tended for humans rather than software agents urivarchar A W3C compliant URI that contains a unique namespace for the DB/authority. Some ID schemes (eg LSID) require this. The URI is intended for soft- ware agents rather than humans. It does not need to be a resolvable URL. However, certain DBs may pre- fer the URI to be a resolvable URL that has human- readable information on the other end. Other DBs may provide URNs (eg LSID URNs) that require software agents to be resolved. Note that it is per- fectly acceptable for the db.name column to be the same as the URI column (provided it is a valid URI). However, it is encouraged that a short form is used as the db.name. See supporting docs for more infor- mation
A unique, global, public, stable identiﬁer. Not necessarily an eXternal reference - can reference data items inside the particular chado instance being used. Typically a row in a table can be uniquely identiﬁed with a primary identiﬁer (called dbxref id); a table may also have secondary identiﬁers (in a linking table ¡T¿ dbxref). A dbxref is generally written as ¡DB¿:¡ACCESSION¿ or as ¡DB¿:¡ACCESSION¿.the ID-spa¡VERSION¿.
Table 2.3: dbxref
ColumnDatatype Description dbxref idinteger db id integer accessionvarchar The local part of the identiﬁer. Guaranteed by the db authority to be unique for that db. version varchar description text
Table 2.4: project
ColumnDatatype Description project id integer name varchar description varchar