URI Format Considerations

Contents

Choosing a URI Format

TBD: [ alanr added 2010-05-03 ]

  1. URI form for dated records
  2. Naming of derivative resources, e.g. an RDF rendering of an Entrez Gene record where the RDF does not originate from NCBI, see Issue/Derivative resources
  3. URI form for contributed annotations - errata

This page is extracted from Steering Committee conversations.

The question is, for a given record identifier, id-space, and 'encoding' (either 'without commitment to particular encoding' or particular encodings), what should the shared names URI look like? In particular:

  1. Are id-spaces designated by numbers or by strings? (See Issue/Id-space names or identifiers)
  2. Are id-spaces encoded in the domain name or the path?
  3. Does the URI have /n/ at the beginning of the path? (See Issue/OCLC noise tradeoff)
  4. Is the encoding given as part of the path (/r/, /html/, etc.) or as a suffix (none, .html, .xml, etc)?
  5. If the id-space is in the path, is it separated from the record designator using / or : ?
  6. Do we use / URIs or # URIs?
  7. Do we support expression of shared names URIs as XML Qnames?

Proposal

Basically, http://sharedname.org/idspace/record . Examples

http://sharedname.org/ncbi_gene/7157
http://sharedname.org/ncbi_gene/7157.xml
http://sharedname.org/ncbi_gene/7157.html
http://sharedname.org/pubmed/20157591

For full discussion of issues see below.

The about-record itself, i.e. the 303 target, could have its own predictable URI: http://sharedname.org/ncbi_gene/7157.about

Suggestion from Issue/Form of URIs: put the idspace metadata record at http://sharedname.org/ncbi_gene . (Alternative: set up a meta-idspace whose 'records' describe idspaces.)

URIs for versions of records and of idspaces need some thought. A version (build, copy, whatever) may be designated either by a string serving as a 'version number' or 'build number', or by a date, of varying precision (YYYY vs. YYYYMMDD). The following concrete proposal is given at Issue/Form of URIs:

Version named by provider at accession level: Use provider's identifer for the version.

Version named by provider "build": Use provider's build identifer in path immediately before ID. There is an about record for such versions.

Other archived version: Use date in path immediately before ID - YYYY[-MM][-DD] with assumption that imprecision implies that we don't know when in the interval the version was. There is no about record for such versions. Information about them is in the metadata for the generic "about".

Discuss at Issue/Form of URIs. TBD.

Phasing of implementation

To get started in a hurry: Phase I: set up to only 303 redirect to the Entrez web page for now [suggests Scott]. Phase II: get the about-record representation and service in place.

As with Entrez Gene, in Phase I we could 303 redirect to the Pubmed web page initially.

Apache configuration would be trivial, so this could be hosted anywhere.

We could start with just these two idspaces, and add more in Phase II.

Decisions implicit in the proposal

Alphabetic vs. numeric idspace

The decision to use name strings like "ncbi_gene" as opposed to numbers as the handle/DOI system uses. The argument for using numbers is that as organizations change name or are merged, the URIs no longer look stale or mention perhaps non-existent organizations or trademarks. Scott and Jonathan think that technically, the latter is a better choice, but guesses that the community isn't ready for it.

The alternative (numeric idspace designator) would look something like this:

http://sharedname.org/12/7157

assuming '12' is the designator that encodes Entrez Gene.

See Issue/Databank_names_or_identifiers for discussion.

[JAR: This issue is unresolved.]

Idspace in path vs. in domain

The decision not to dispatch DNS off subdomains. The alternative looks like:

http://pubmed.sharedname.org/20157591

or

http://9.sharedname.org/20157591

Subdomains depend on unimplemented functionality in the PURLZ (and PURL) software (did Zepheira say they'd be adding this?) and complicates the federation arrangement as it gives the possibility of some servers only serving one idspace, but there have been advocates for this style.

See Issue/Form of URIs for discussion.

[JAR: This issue is unresolved. Need to find out PURLZ's plans and steering committee inclinations.]

No OCLC noise vs. OCLC noise

The willingness to not have the current OCLC PURL server (purl.org) in the rotation. OCLC would have to adopt the PURL federation software when it is ready in order to be in the rotation. We can't be certain that they will do that (yet).

We would need the /n/ in the URI if OCLC is in the rotation and does not adopt the new PURLZ software:

http://sharedname.org/n/pubmed/20157591

See Issue/OCLC_noise_tradeoff

[JAR: I believe we have consensus to not insist on compatibility with purl.org's current software. purl.org will not be a replicate until such time as it updates to PURLZ.]

Encoding as dot-suffixes vs. encoding as path component

The use of suffixes to choose desired content, i.e. http://sharedname.org/pubmed/20157591.xml, http://sharedname.org/pubmed/20157591.html, etc.

This is not supported by old PURL software (OCLC). To support what OCLC is currently running, we would need one of

http://sharedname.org/n/pubmed/rdf/20157591 or
http://sharedname.org/n/rdf/pubmed/20157591.

(compare with the 2007 Science Commons shared names predecessor)

[JAR: I believe we have consensus to not insist on compatibility with purl.org's current software. purl.org will not be a replicate until such time as it updates to PURLZ. So we likely have consensus on the use dot-suffixes.]

Slash vs. colon as separator

The decision to use / rather than : to separate the id-space designator from the record designator. Bio2RDF and LSRN both use : . This hasn't been discussed much, but / works with the PURLZ software, and supports relative URIs.

http://sharedname.org/pubmed:20157591

[JAR: Consensus I believe.]

Slash vs. hash URIs

The decision to use / URIs rather than # URIs for the record-without-commitment-to-encoding. / URIs will require either 303 redirects or an as-yet-undeployed protocol using the Link: header in order to provide applications with RDF. # URIs work only if the record designator precedes the #, in which case we can't use a common prefix in SPARQL.

http://sharedname.org/pubmed/20157591#r
http://sharedname.org/pubmed/20157591#a (as in "accession")
http://sharedname.org/pubmed/20157591#_
http://sharedname.org/pubmed/20157591#

Hash URIs don't require the 303 dance, and are therefore easier both to deploy and to use.

[JAR: Consensus I believe, although I don't think it's been discussed explicitly.]

No NCnames vs. NCnames

The decision not to write record designators as NCnames. To support expression of URIs as XML (and N3) Qnames, we would need for the URI to end with an NCname, and an NCname has to begin with a letter or _. Prefixing a numeric record designator with a letter is likely unsupportable in the PURLZ software (should we check with Zepheira?). And it would fail, or become complicated or unsupportable, with id-spaces whose record designators include non-NCname characters (such as ...?).

http://sharedname.org/pubmed/r20157591
http://sharedname.org/pubmed/a20157591
http://sharedname.org/pubmed/_20157591
http://sharedname.org/pubmed/PM20157591

N.b. this consideration does not apply to SPARQL or RDFa.

[JAR: Consensus I believe, although I don't think it's been discussed explicitly.]

sharedname.org vs. something else

The decision to use the domain name 'sharedname.org' (not sharednames.org, sharedname.net. Also not lsrn.org). [JAR: consensus I believe]

http: vs. something else

The decision to use http: URIs as opposed to info:, urn:, or some other kind of URI, or some non-URI syntax [JAR: I believe the steering committee is in agreement with this, but I can't find documentation supporting this. Implicit here maybe.]

Demo summary

[JAR: I think this section can be deleted]

This section discusses the URIs used in the demo, as a point of comparison. (It is not proposing we continue with those URIs.) See also Issue/Form of URIs. As an example we take Entrez Gene record 7157.

For Entrez, if we followed the scheme introduced in the demo, the URI for record would be

http://sharedname.org/n/r/ncbi_gene/7157

"n" to have a purl domain name that can be served by the OCLC purl server if we want. "r" to denote that we are talking about the record without commitment to representation. If we are confident that the purl server will be able to dispatch off suffixes, the we could omit that and have it be:

http://sharedname.org/n/ncbi_gene/7157

The "n" in the above URI is not necessary as a placeholder because Zepheira has agreed to add virtual hosting to the PURL software.