This page or section is under construction.
Should be done before Jan 1, 2011
The Chado Phenotype module was developed at FlyBase.
To try and understand the existing phenotype module during the GMOD Evo Hackathon, participants spent some time exploring the phenotype tables at FlyBase. The notes from that exploration are on this page.
Phenotypes are not cleanly implemented in a single module. The tables are spread across the Phenotype and Genetic modules. This page discusses the tables from both modules. It also covers some related tables.
These notes reflect the state of the FlyBase Chado database during the first two weeks on November 2010.
Table | rows at FlyBase |
---|---|
feature_phenotype |
0 |
phenotype |
10,925 |
phenotype_comparison |
100,235 |
phenotype_cvterm |
7,714 |
phenstatement |
20,673 |
phendesc |
112,390 |
And those are the only tables that have “phen” in their name.
environment
(Doc: “The environmental component of a phenotype description.”)genotype
And that’s it?
phenotype
<protect>
Table: | phenotype |
Module: | Phenotype |
A phenotypic statement, or a single atomic phenotypic observation, is a controlled sentence describing observable effects of non-wild type function. E.g. Obs=eye, attribute=color, cvalue=red.
FK | Name | Type | Description |
---|---|---|---|
phenotype_id | serial | PRIMARY KEY |
|
uniquename | text | UNIQUE NOT NULL |
|
cvterm |
observable_id | integer | The entity: e.g. anatomy_part, biological_process. |
cvterm |
attr_id | integer | Phenotypic attribute (quality, property, attribute, character) - drawn from PATO. |
value | text | Value of attribute - unconstrained free text. Used only if cvalue_id is not appropriate. | |
cvterm |
cvalue_id | integer | Phenotype attribute value (state). |
cvterm |
assay_id | integer | Evidence type. |
phenotype
columns
Tables referencing phenotype
via foreign
key constraints:
</protect>
From conversation with Jim Balhoff and Matt Yoder:
A phenotype is what you observe.
This means that stage or time since conception is not part of the phenotype. It is part of the experiment. (Unless you are observing development rate, in which case, stage would be the phenotype.)
That table description in the schema sounds like the old PATO, not the new EQ model.
Does this table implement PATO at FlyBase?
No
phenotype
@ FlybaseCounts | |
---|---|
Table Rows | 10925 |
observable_id is null | 1 |
attr_id is null | 1 |
cvalue_id is null | 1 |
assay_id is null | 1 |
value is null | 10925 |
All the CVterm FKs that are null have the uniquename of “unspecified”
Most uniquenames are observable_id -> cvterm.name + blank or integer, but not when name is “unspecified”, uniquename can be anything. Some examples:
There are 1473 of these (every one of them unspecified).
SELECT COUNT(*), c.name
FROM phenotype p, cvterm t, cv c
WHERE t.cvterm_id = p.observable_id
AND t.cv_id = c.cv_id
GROUP BY 2
ORDER BY 1 DESC
LIMIT 50;
count | name
------+--------------------------
8914 | FlyBase anatomy CV
1473 | FlyBase miscellaneous CV
436 | cellular_component
91 | biological_process
9 | property type
1 | SO
(6 rows)
attr_id is supposed to be “Phenotypic attribute (quality, property, attribute, character) - drawn from PATO.” but is not.
assay_id is supposed to be “Evidence type” but it is not.
At FlyBase, in all rows (except the null one) both of these point to cvterm “unspecified”
So, this is not used for PATO at Flybase. I’m not sure why it isn’t null.
Schema doc says “Phenotype attribute value (state).”
There are only 118 distinct values. Some of them:
All from: FlyBase miscellaneous CV
value is null in every row in FlyBase.
phenotype_comparison
<protect>
Table: | phenotype_comparison |
Module: | Genetic |
Comparison of phenotypes e.g., genotype1/environment1/phenotype1 “non-suppressible” with respect to genotype2/environment2/phenotype2.
FK | Name | Type | Description |
---|---|---|---|
phenotype_comparison_id | serial | PRIMARY KEY |
|
genotype |
genotype1_id | integer | UNIQUE#1 NOT NULL |
environment |
environment1_id | integer | UNIQUE#1 NOT NULL |
genotype |
genotype2_id | integer | UNIQUE#1 NOT NULL |
environment |
environment2_id | integer | UNIQUE#1 NOT NULL |
phenotype |
phenotype1_id | integer | UNIQUE#1 NOT NULL |
phenotype |
phenotype2_id | integer | |
pub |
pub_id | integer | UNIQUE#1 NOT NULL |
organism |
organism_id | integer | NOT NULL
|
phenotype_comparison
columns
Tables referencing phenotype_comparison
via foreign
key constraints:
</protect>
This table is good for binary comparisons. How broadly useful is that?
phenotype_comparison
@ FlybaseCounts | |
---|---|
Table Rows | 100,235 |
phenotype2_id is null | 0 |
phenotype1_id = phenotype2_id | 42,510 |
genotype1_id = genotype2_id | 0 |
environment1_id = environment2_id | 99,293 |
Every column but phenotype2_id is not nullable, and that column has no nulls in it either.
The above counts imply a lot comparisons using different genotypes in the same environment. No it does not. In every one of those 99K comparisons, environment is “unspecified”.
So, FlyBase has known environments for 1% of its comparisons.
The 42K comparisons where the phenotypes are the same mean that the genotype did not change this phenotype.
type_id is a phantom column. It does not exist in current Chado or at FlyBase.
How do we say how they compare without this column?
What makes two G/E/P pairs comparable?
phenotype_cvterm
<protect>
Table: | phenotype_cvterm |
Module: | Phenotype |
FK | Name | Type | Description |
---|---|---|---|
phenotype_cvterm_id | serial | PRIMARY KEY |
|
phenotype |
phenotype_id | integer | UNIQUE#1 NOT NULL |
cvterm |
cvterm_id | integer | UNIQUE#1 NOT NULL |
rank | integer | UNIQUE#1 NOT NULL
|
phenotype_cvterm
columns
Tables referencing phenotype_cvterm
via foreign
key constraints:
Can imagine it just attaches a bag of terms to the phenotype. There is no semantics on how the terms relate to each other or to the phenotype.
phenotype_cvterm
@ FlybaseCounts | |
---|---|
Table Rows | 7714 |
Flybase’s table def has a rank column, which is > 0 for about 2000 rows. The rank column is part of standard Chado, but not listed on the web site.
There is a unique constraint on (phenotype_id, cvterm_id, rank). Why?
Links to terms in 5 different CVs
Find rows that use rank in the unique constraint.
SELECT * FROM phenotype_cvterm pcv1
WHERE EXISTS (
SELECT * FROM phenotype_cvterm pcv2
WHERE pcv1.phenotype_cvterm_id <> pcv2.phenotype_cvterm_id
AND pcv1.phenotype_id = pcv2.phenotype_id
AND pcv1.cvterm_id = pcv2.cvterm_id) LIMIT 10;
phenotype_cvterm_id | phenotype_id | cvterm_id | rank
--------------------+--------------+-----------+------
8763 | 15842 | 60843 | 0
8764 | 15842 | 60843 | 1
12006 | 19433 | 60843 | 1
12007 | 19433 | 60843 | 2
(4 rows)
So, what does that mean?
Rank is used for some sort of code that is not stored in the database?
phenstatement
<protect>
Table: | phenstatement |
Module: | Genetic |
Phenotypes are things like “larval lethal”. Phenstatements are things like “dpp-1 is recessive larval lethal”. So essentially phenstatement is a linking table expressing the relationship between genotype, environment, and phenotype.
FK | Name | Type | Description |
---|---|---|---|
phenstatement_id | serial | PRIMARY KEY |
|
genotype |
genotype_id | integer | UNIQUE#1 NOT NULL |
environment |
environment_id | integer | UNIQUE#1 NOT NULL |
phenotype |
phenotype_id | integer | UNIQUE#1 NOT NULL |
cvterm |
type_id | integer | UNIQUE#1 NOT NULL |
pub |
pub_id | integer | UNIQUE#1 NOT NULL
|
phenstatement
columns
Tables referencing phenstatement
via foreign
key constraints:
phenstatement
@ FlybaseCounts | |
---|---|
Table Rows | 203673 |
All columns are not null; all except PK are FKs. Therefore phenstatements require
Well, sort of.
There are 72K linked genotypes. They use a custom markup language in the uniquename.
phendesc
<protect>
Table: | phendesc |
Module: | Genetic |
A summary of a _set_ of phenotypic statements for any one gcontext made in any one publication.
FK | Name | Type | Description |
---|---|---|---|
phendesc_id | serial | PRIMARY KEY |
|
genotype |
genotype_id | integer | UNIQUE#1 NOT NULL |
environment |
environment_id | integer | UNIQUE#1 NOT NULL |
description | text | NOT NULL |
|
cvterm |
type_id | integer | UNIQUE#1 NOT NULL |
pub |
pub_id | integer | UNIQUE#1 NOT NULL
|
phendesc
columns
Tables referencing phendesc
via foreign
key constraints:
You can’t link directly from a
phenotype
to phendesc
. You can link directly from a
phenstatement
to a {[ChadoTableName|phendesc}} using (genotype_id, environment_id,
and pub_id). Note that this does not take full advantage of the unique
index on those 3 columns plus type_id.
This table establishes two things:
This table means there can only be one phendesc for each environment/genotype/pub publication. I think that makes sense.
It does mean we wont link to phenstatements with the wrong genotype/environment, but it also means we can orphan phendesc rows that don’t join with any phenstatements.
phendesc
@ FlybaseCounts | |
---|---|
Table Rows | 112,390 |
description is null | 0 |
No columns are nullable. There is description, 4 FKs (genotype_id, environment_id, type_id, pub_id), and the PK.
Description is not unique with “Homozygous lethal.” and “homozygous lethal” making up ~1% of records.
SELECT COUNT(*), t.name, c.name
FROM phendesc p, cvterm t, cv c
WHERE p.type_id = t.cvterm_id
AND t.cv_id = c.cv_id
GROUP BY 2,3
ORDER BY 1 DESC
LIMIT 50;
count | name | name
------+---------------------------+---------------
71688 | single_mutant_pheno | phendesc type
18648 | genetic_interaction_pheno | phendesc type
12809 | aberr_pheno | phendesc type
7662 | interallele_comp | phendesc type
1583 | xeno_interaction_pheno | phendesc type
(5 rows)
11,930 rows (about 10% of rows) link to the pub with no title.
101,746 rows (over 90%) link to “unspecified”
There is wide distribution of genotypes. The top genotype has 122 phendesc records.
Free form description of phenotype. Almost all entries are informative, and none are null.
environment
Definition:
<protect>
Table: | environment |
Module: | Genetic |
The environmental component of a phenotype description.
FK | Name | Type | Description |
---|---|---|---|
environment_id | serial | PRIMARY KEY |
|
uniquename | text | UNIQUE NOT NULL |
|
description | text |
environment
columns
Tables referencing environment
via foreign
key constraints:
</protect>
environment
@ FlybaseCounts | |
---|---|
Table Rows | 15 |
description is null in all rows.
Uniquename choice are:
unspecified
conditional qualifier
temperature conditional
drug conditional
conditional tetracycline
conditional RU486
Drosophila cell culture
In transgenic Drosophila (intraspecific)
In transgenic Drosophila (allele of one drosophilid species in genome of another drosophilid)
In transgenic Drosophila (allele of foreign species in genome of drosophilid)
Whole-organism transient assay (intraspecific)
Whole-organism transient assay (allele from one drosophilid species assayed in another drosophilid)
Whole-organism transient assay (allele of foreign species assayed in drosophilid)
conditional - heat sensitive
conditional - cold sensitive
environment_cvterm
Definition:
<protect>
Table: | environment_cvterm |
Module: | Genetic |
FK | Name | Type | Description |
---|---|---|---|
environment_cvterm_id | serial | PRIMARY KEY |
|
environment |
environment_id | integer | UNIQUE#1 NOT NULL |
cvterm |
cvterm_id | integer | UNIQUE#1 NOT NULL
|
environment_cvterm
columns
Tables referencing environment_cvterm
via foreign
key constraints:
No semantics here. My guess is associates a list of CV terms with an environment.
environment_cvterm
@ FlybaseCounts | |
---|---|
Table Rows | ? |
genotype
Definition:
<protect>
Table: | genotype |
Module: | Genetic |
Genetic context. A genotype is defined by a collection of features, mutations, balancers, deficiencies, haplotype blocks, or engineered constructs.
FK | Name | Type | Description |
---|---|---|---|
genotype_id | serial | PRIMARY KEY |
|
name | text | Optional alternative name for a genotype,
for display purposes. |
|
uniquename | text | UNIQUE NOT NULL |
|
description | character varying(255) |
genotype
columns
Tables referencing genotype
via foreign
key constraints:
</protect>
genotype
@ FlybaseCounts | |
---|---|
Table Rows | 263,153 |
description is null | 263,153 |
name is null | 165,210 |
description and name are null | 165,210 |
There is no genotype_cvterm
table.
Uniquename is the only informative column is 2/3 of the rows. It uses a custom FlyBase encoding of information.
In the cases where this is not null it uses a custom FlyBase nomenclature that often includes one or more FlyBase IDs.
feature_genotype
Definition:
<protect>
Table: | feature_genotype |
Module: | Genetic |
FK | Name | Type | Description |
---|---|---|---|
feature_genotype_id | serial | PRIMARY KEY |
|
feature |
feature_id | integer | UNIQUE#1 NOT NULL |
genotype |
genotype_id | integer | UNIQUE#1 NOT NULL |
feature |
chromosome_id | integer | UNIQUE#1 A feature of SO type "chromosome". |
rank | integer | UNIQUE#1 NOT NULL |
|
cgroup | integer | UNIQUE#1 NOT NULL |
|
cvterm |
cvterm_id | integer | UNIQUE#1 NOT NULL
|
feature_genotype
columns
Tables referencing feature_genotype
via foreign
key constraints:
This is the table that connects data to features, finally. The rank and group have some semantics.
feature_genotype
@ FlybaseCounts | |
---|---|
Table Rows | 551,551 |
chromosome_id is null | 0 |
All cvterm_ids are unspecified. Every one of them. No doc on this.
Spatially distinguishable group. group can be used for distinguishing the chromosomal groups, for example (RNAi products and so on can be treated as different groups, as they do not fall on a particular chromosome).
455K are 0; After that it diminishes up to 6, which only has 5 records.
What does this mean? It’s not an FK to anywhere, just an integer that must be unique in combination with feature_id, genotype_id, chromosome, rank, cgroup, and cvterm_id (every non-PK column).
Every record points to the same feature_id, which has the name “unspecified”. Is this supposed to be a denormalization optimization that they don’t use at FlyBase?
rank can be used for n-ploid organisms or to preserve order.