GMOD

Sample Chado SQL

Contents

Abstract

This HOWTO provides several sample SQL queries to view basic genome data from a Chado database. Some of these are drawn from the GMODTools configuration file GMODTools/conf/bulkfiles/chadofeatsql.xml Example output of some of these is shown in the tables at http://insects.eugenes.org/genome/Drosophila_melanogaster/current/tables/

PostgreSQL queries

The simplest way to test contents of a Chado database is with the psql command line program that is part of the PostgreSQL software. The following sample SQL code can be used this way from a Unix or MacOSX command line.

 dgbook% psql -l
           List of databases
      Name      |  Owner   | Encoding
 ---------------+----------+-----------
  dev_chado_01b | gilbertd | SQL_ASCII

 dgbook% psql dev_chado_01b
  dev_chado_01b=# select organism_id,count(*) from feature group by organism_id;
  organism_id | count
 -------------+-------
            7 |     8
            6 |    10
            3 |    10
           10 |  1605

organism_summary

This lists organisms and number of features per organism.

   SELECT o.organism_id,o.abbreviation,o.genus,o.species,o.common_name, 
          count(f.feature_id) as n_features, o.comment
   FROM organism o LEFT JOIN feature f USING (organism_id)
   GROUP by o.organism_id,o.abbreviation,o.genus,o.species,o.common_name,o.comment
   ORDER BY o.genus,o.species
   ;

sample result

 Organism_id   Abbreviation    Genus   Species Common_name N_features  Comment
 105    \N      Abrostola   asclepiadis     \N  1   \N
 ..
   1    Dmel    Drosophila  melanogaster    fruit fly   725035  \N
 214    Dpse    Drosophila  pseudoobscura   \N  137045  \N

feature_summary

This lists number of features and sequences by species and type.

 SELECT 
   f.type_id,
   t.name as Feature_type,
   count(f.feature_id) as N_features,
   sum(length(f.residues)) as N_residues,
   sum(f.seqlen) as Tot_len,
   ROUND( AVG(f.seqlen), 0 ) as Ave_len,
   MIN(f.seqlen) as Min_len,
   MAX(f.seqlen) as Max_len,
   (select genus || '_' || species from organism where organism_id = f.organism_id) as Species
 FROM feature f, cvterm t
 WHERE f.type_id = t.cvterm_id
 GROUP BY f.organism_id, f.type_id, t.name
 ORDER BY species, feature_type
 ;

Sample result

 Type_id   Feature_type    N_features  N_residues  Tot_len Ave_len Min_len Max_len Species
 256    EST 308722  143832868   \N  \N  \N  \N  Drosophila_melanogaster
 562    cDNA    13464   26962334    \N  \N  \N  \N  Drosophila_melanogaster
 210    chromosome_arm  14  246701062   246701062   18977005    1237870 27905053    Drosophila_melanogaster
 257    exon    64498   \N  31208553    484 3   27725   Drosophila_melanogaster
 219    gene    14828   \N  70649778    4899    16  171463  Drosophila_melanogaster
 720    insertion_site  622 \N  28  0   0   1   Drosophila_melanogaster
 368    mRNA    19389   44104476    44104476    2275    132 69571   Drosophila_melanogaster
 450    miRNA   66  1498    1498    23  20  29  Drosophila_melanogaster
 426    ncRNA   116 115179  115179  993 19  14084   Drosophila_melanogaster
 733    point_mutation  1444    \N  43  1   1   1   Drosophila_melanogaster
 499    polyA_site  122 \N  \N  \N  \N  \N  Drosophila_melanogaster
 1179   protein 22219   12111034    10921981    563 25  23015   Drosophila_melanogaster

chromosome_summary

This is an extention of feature summary, to identify source features. These are the chromosomes, scaffolds, contigs or other features on which other (genes, etc.) are located. It is essential for Chado software to know about these source features. If you are working with someone else’s Chado database, this query will tell you explicitly which features contain others.

 SELECT 
   f.type_id,
   t.name as Feature_type,
   count(f.feature_id) as N_features,
   sum(length(f.residues)) as N_residues,
   sum(f.seqlen) as Tot_len,
   sum(CASE WHEN fl.srcfeature_id = f.feature_id THEN 1 ELSE 0 END) as N_issource,
   sum(CASE WHEN fl.feature_id = f.feature_id THEN 1 ELSE 0 END) as N_istarget,
   (select genus || '_' || species from organism where organism_id = f.organism_id) as Species
 FROM cvterm t, feature f 
   left join featureloc fl on (fl.srcfeature_id = f.feature_id or fl.feature_id = f.feature_id)
 WHERE f.type_id = t.cvterm_id
 GROUP BY f.organism_id, f.type_id, t.name
 ORDER BY species, feature_type

See the n_issource column for source features, the n_istarget lists the target features contained in sources.

type_id   feature_type  n_features    n_residues      tot_len     n_issource   n_istarget  
     66   centromere            16                          1883           0           16 
     64   chromosome        146644   132193461745   132193461745      146644            0 
    124   gene                6609                       8889442           0         6609 
    133   intron               392                        122291           0          392 
    156   ncRNA                483                         76213           0          483

Note the n_issource count is the number of features contained in the chromosomes, not number of chromosomes (see above feature_summary for that).

analysis_summary

This lists analyses and number of features per analysis.

 SELECT 
   an.analysis_id,
   CASE WHEN (an.sourcename IS NULL OR an.sourcename = 'dummy') THEN 'match:' || an.program
     ELSE 'match:' || an.program || ':' || an.sourcename
   END AS Analysis_type, 
   count(f.feature_id) as N_features,
   ROUND( (AVG(af.rawscore)::numeric), 2 ) as Ave_score,
   ROUND( (AVG(af.significance)::numeric),  2 ) as Ave_sig,
   (select genus || '_' || species from organism where organism_id = f.organism_id) as Species
 FROM feature f, analysisfeature af, analysis an
 WHERE an.analysis_id = af.analysis_id and af.feature_id = f.feature_id
 GROUP BY f.organism_id, an.analysis_id, Analysis_type  
 ORDER BY species, Analysis_type
 ;

Sample result

 Analysis_id   Analysis_type   N_features  Ave_score   Ave_sig Species
 68 match:aubrey_cytolocator:cytology   5770    \N  \N  Computational_result
 70 match:augustus  60764   \N  \N  Computational_result
 53 match:blastx_masked:aa_SPTR.insect  53629   200.60  \N  Computational_result
 44 match:repeatmasker  23486   3922.55 \N  Computational_result
 78 match:tblastn:Dmel r3.1 12179   \N  \N  Computational_result

sequence_ontology

Specifying the sequence ontology section of the cv and cvterm tables is a small but essential bit of a Chado database that software needs for configuration. Not everyone uses the same name, though Sequence Ontology Feature Annotation is recommended.

 select cv_id,name from cv where cv_id in (select cv_id from cvterm where name = 'exon');
 cv_id        name                 
    8  Sequence Ontology Feature Annotation
    9  Sequence Ontology

property_summary

This lists properties and number of features per analysis.

 SELECT 
   fp.type_id,
   t.name as Property_type,
   count(fp.featureprop_id) as N_properties,
   count(distinct f.feature_id) as N_features,
   count(distinct fp.value) as N_values,
   (select genus || '_' || species from organism where organism_id = f.organism_id) as Species
 FROM feature f, featureprop fp, cvterm t
 WHERE fp.type_id = t.cvterm_id and fp.feature_id = f.feature_id
 GROUP BY f.organism_id, fp.type_id, t.name
 ORDER BY Species, Property_type
 ;

Sample result

 Type_id   Property_type   N_properties    N_features  N_values    Species
 57127  citation    5467    4073    1141    Drosophila_melanogaster
 59945  comment 13005   9699    11264   Drosophila_melanogaster
     7  description 312199  312196  312171  Drosophila_melanogaster
 59954  dicistronic 109 109 1   Drosophila_melanogaster
 59959  mutant_in_strain    14  14  5   Drosophila_melanogaster
 59951  non_canonical_splice_site   488 488 2   Drosophila_melanogaster
 59953  problem 6350    6350    12  Drosophila_melanogaster

gene_page

This is a sample gene page view to list most attributes for a gene feature. NOTE: this kind of multi-table join view can be very slow to execute on a large genome database.

Usage: dev_chado_01c=# select v.* from v_genepage2 v join feature as f using (feature_id) where f.name = ‘PAU1’;

 CREATE OR REPLACE VIEW v_genepage2 
     (feature_id, field, value)
   AS
     SELECT feature_id AS feature_id, 'Name' as field, name as value FROM feature
   UNION ALL
     SELECT feature_id AS feature_id, 'uniquename' as field, uniquename as value FROM feature 
   UNION ALL
     SELECT feature_id AS feature_id, 'seqlen' as field, cast(seqlen as text) as value FROM feature  
   UNION ALL
     SELECT f.feature_id AS feature_id, 'type' as field, c.name as value 
     FROM feature f, cvterm c  WHERE f.type_id = c.cvterm_id 
   UNION ALL
     SELECT f.feature_id AS feature_id, 'organism' as field, o.abbreviation as value  
     FROM feature f, organism o WHERE  f.organism_id = o.organism_id 
   
   UNION ALL
     SELECT fs.feature_id AS feature_id,  
       CASE WHEN fs.is_current IS FALSE THEN 'Synonym_2nd' ELSE 'Synonym' END AS field,
       s.name as value 
     FROM  feature_synonym fs, synonym s 
     WHERE fs.synonym_id = s.synonym_id  
   
   UNION ALL
     SELECT f.feature_id AS feature_id, 'Dbxref' as field, gd.name||':'||gx.accession as value
     FROM   feature f, db gd, dbxref gx
     WHERE  f.dbxref_id = gx.dbxref_id and gx.db_id = gd.db_id  
   
   UNION ALL
     SELECT fs.feature_id AS feature_id,  
       CASE WHEN fs.is_current IS FALSE THEN 'Dbxref obsolete' ELSE 'Dbxref 2' END AS field, 
       (d.name || ':' || s.accession)::text AS value
     FROM  feature_dbxref fs, dbxref s, db d
     WHERE fs.dbxref_id = s.dbxref_id and s.db_id = d.db_id
   
   UNION ALL
     SELECT fc.feature_id AS feature_id, c.name AS field, 
           substr(cv.name,1,40) || '; '|| dx.accession AS value
     FROM  feature_cvterm fc, cvterm cv, cv c, dbxref dx
     WHERE fc.cvterm_id = cv.cvterm_id and cv.cv_id = c.cv_id  
           and cv.dbxref_id = dx.dbxref_id
   
   UNION ALL
     SELECT fp.feature_id AS feature_id, cv.name AS field, fp.value AS value
     FROM  featureprop fp, cvterm cv
     WHERE fp.type_id = cv.cvterm_id  
   
   UNION ALL
     SELECT fl.feature_id AS feature_id, 'location' as field, 
         chr.uniquename ||':'|| cast( fl.fmin+1 as text) ||'..'|| cast( fl.fmax as text)
         || CASE 
           WHEN fl.strand IS NULL THEN ' '
           WHEN fl.strand < 0 THEN ' [-]'
           ELSE ' [+]'
           END AS value
     FROM   featureloc fl, feature chr
     WHERE  fl.srcfeature_id = chr.feature_id 
   
   UNION ALL
     SELECT af.feature_id AS feature_id,   
      'an:' ||  
      CASE 
         WHEN a.name IS NOT NULL THEN a.name
         WHEN a.sourcename IS NOT NULL THEN (a.program || '.' || a.sourcename)::text
         ELSE a.program
       END  AS field,
       CASE  
         WHEN af.rawscore IS NOT NULL THEN cast(af.rawscore as text)
         WHEN af.normscore  IS NOT NULL  THEN cast(af.normscore  as text)
         WHEN af.significance  IS NOT NULL THEN cast(af.significance as text)
         ELSE cast(af.identity  as text)
       END  AS value 
     FROM   analysisfeature af, analysis a
     WHERE  af.analysis_id = a.analysis_id 
 ;

simple gene_page output

 dev_chado_01c=# select v.* from v_genepage2 v join feature as f using (feature_id) where f.name = 'PAU1';
  feature_id |   field    |          value           
 ------------+------------+--------------------------
          23 | Name       | PAU1
          23 | uniquename | PAU1
          23 | seqlen     | 
          23 | type       | gene
          23 | organism   | S.cerevisiae
          23 | Synonym    | PAU1
          23 | Dbxref     | GeneID:853232
          23 | Dbxref 2   | GFF_source:GenBank
          23 | Dbxref 2   | GeneID:853232
          23 | gene       | PAU1
          23 | locus_tag  | YJL223C
          23 | location   | NC_001142:8776..9138 [-]

The above data was loaded from Yeast GenBank Genome (i.e. not very complex)

longer gene_page output

See this Sample_Chado_gene_report for a well studied gene from FlyBase chado release 5.

More Information

Please send questions to the GMOD developers list:

gmod-devel@lists.sourceforge.net

Authors

Categories:

Documentation

Community

Tools