Issue/Provider partitions
Contents |
Discussion
email:A potential source of instability
A number of what one might consider databases, in the common schemes (lsrn, ncbi, etc) are given multiple identifier spaces, e.g
Taking a couple of examples from LSRN
ATCC
- ATCC: http://www.atcc.org/ATCCAdvancedCatalogSearch/ProductDetails/tabid/452/Default.aspx?Template=cellBiology&ATCCNum=__ID__
- ATCC_dna: http://www.atcc.org/ATCCAdvancedCatalogSearch/ProductDetails/tabid/452/Default.aspx?Template=bioproducts&ATCCNum=__ID__
COG
- COG_Cluster: http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgi?__ID__
- COG_Function: http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgi?fun=__ID__
- COG_Pathway: not sure if it still exists
(note that the current templates from LSRN, NCBI, and Freebase, at least, for both these sets are broken. The URLs given above are correct)
Splitting these providers records into several sets makes the redirection possible by a template that only takes the identifier as argument - note the variations in the atcc URL "cellBiology" versus "bioproducts".
The reason this leads to instability is that we don't know if the provider will change the classification and tags they use in the URLs. If they simplify, merging categories, we are fine i.e. if ATCC dropped the Template parameter we would redirect both ATCC and ATCC_dna to http://www.atcc.org/ATCCAdvancedCatalogSearch/ProductDetails/tabid/452/Default.aspx?ATCCNum=__ID__
However in the other direction we can get messed up. Suppose they split bioproducts into dna and cdna, and the difference wasn't apparent from the accession. Then, we wouldn't know how how to redirect a given identifier, without going to a more complex strategy for deciding redirection.
Therefore we need to figure out what to do about this. Options I see:
- Use the current classification, but commit to moving to a model of having redirections for every accession stored in our database, instead of a template, should the provider URLs change to something incompatible with the simple redirection.
- Contact the providers and see whether they are willing to provide a single URL pattern to access any record they provide, assuming identifier spaces don't overlap.
- Move immediately to support per-access redirects, rather than the template per space.
In all likelihood we will have to support 1, since in the worst case that's the only remedy, short of having clients change their URLs, which I consider to be out of the question.
It may or may not be worth pursuing 2. Don't know.
Note: Recently found: http://www.atcc.org/SearchCatalogs/Linkin?id=__ID__ which works for all ids!
Examples
NCBI
- www.ncbi.nlm.nih.gov:, CCDS, taxon, dbSTS, dbSNP, dbProbe, dbEST, dbCloneLib, dbClone, UniSTS, PID, MIM, LocusID, GeneID, COG, CDD, AceView_WormGenes
- www.informatics.jax.org: MGI, MGD
- www.uniprot.org: UniProt_TrEMBL, UniProt_SwissProt
- imgt.cines.fr: IMGT_LIGM, IMGT_GENEDB
LSRN
- stke.sciencemag.org: STKECM_CMC, STKECM_CMP
- www.candidagenome.org: CGD_REF, CGD_LOCUS, CGD
- pubchem.ncbi.nlm.nih.gov: PubChem_Compound, PubChem_Substance
- www.ncbi.nlm.nih.gov: UniGene, GeneID, OMIM, NCBI_GP, COG_Function, COG_Pathway, INSD, AceView_WormGenes, PMID, COG_Cluster
- www.h-invdb.jbic.or.jp: H-invDB_cDNA, H-invDB_locus
- antirrhinum.net: DragonDB_DNA, DragonDB_Protein, DragonDB_Locus, DragonDB_Allele
- www.genedb.org: GeneDB_Lmajor, GeneDB, GeneDB_Tbrucei, GeneDB_Spombe, GeneDB_Pfalciparum, GeneDB_Gmorsitans
- www.ebi.ac.uk: CHEBI, UniParc, BIOMD, INSD, IntAct, InterPro
- umbbd.ahc.umn.edu: UM-BBD_pathwayID, UM-BBD_enzymeID
- pir.georgetown.edu: PIR, PIRSF
- www.genome.ad.jp: KEGG_DRUG, KEGG_PATHWAY, LIGAND, EC, KEGG_COMPOUND
- www.informatics.jax.org: MGD, MGI
- www.jbirc.aist.go.jp: H-invDB_cDNA, H-invDB_locus
- dictybase.org: DDB_gene_name, DDB_REF DDB
- www.gramene.org: GR_protein, GR_REF, GR_MUT, GR
- www.maizegdb.org: MaizeGDB_Locus, MaizeGDB
- www.wormbase.org: WB_REF, WP, WB
- db.yeastgenome.org: SGD_LOCUS, SGD, SGD_REF
- db.ciliate.org: TGD_REF, TGD_LOCUS
- www.gene.ucl.ac.uk: HGNC, HGNC_Symbol
- mips.gsf.de: MIPS_funcat, AGI_LocusCode
- www.tigr.org: TIGR_EGAD, TIGR_GenProp, TIGR_TGI, TIGR_Tba1, TIGR_Pfa1, AGI_LocusCode, TIGR_Ath1, TIGR_TIGRFAMS, TIGR_CMR
- www.atcc.org: ATCC, ATCC_dna
- www.ebi.uniprot.org: TrEMBL, UniProt
- www.expasy.ch: Prosite, ENZYME
