Design notes (JAR)

See also: Issue

Here are some of my thoughts on the technical design of a system that might meet our requirements.

The problem I worry about most is what I call "survivability" - reducing the cost and overhead of implementation and maintenance, in order that the system be able to survive changes in ownership, hosting, funding, availability, etc. I figure that if the system is simple enough, it can be rehosted or even reimplemented so easily that anyone relying on it could do so, if they wanted to.

Contents

Id-space registry

By "id-space" I mean an identifier namespace in a publicly accessible life sciences database or repository, of the sort named in the LSRN registry. Examples: PDB, Genbank.

A registry of id-spaces known to the shared names system will be maintained and published somewhere. The registry will have for each id-space at least a canonical short name, and a prose description. (Better: primary URI for the id-space, and other information along the lines of what LSRN keeps. Better: All this information available in RDF.)

The registry should be versioned and can be updated by anyone authorized by the Steering committee. Update should be possible using a friendly interface such as a wiki, even if the underlying information is maintained in RDF or in a triple store. Maintenance could be centralized, with slave copies propagated from a master via rsync, or distributed, using something like Git. See comments on replication below.

There is an issue around hierarchical id-spaces - some information sources have multiple identifier spaces. We will have to figure out whether and how the naming of the id-space reflects the organization of id-spaces into sources.

URIs for records in id-spaces

Note: I use the word "record" loosely to mean a bundle of information keyed by a record id that is local to the id-space - very roughly speaking, what you see when you visit the databank and ask for information by id. This does not imply a "database record" in the sense that one might use it in a DBMS, but rather the aggregate of information specific to the particular key. If the databank is hosted by a DBMS then this information may be drawn from several tables. From the point of view of the shared names system, the internals of a record are opaque.

A simple function RWC(id-space short name, record id), chosen before deployment and fixed thereafter, maps to http: URIs (shared names) via simple template-filling, e.g. RWC("pdb", "2vb5") => the URI naming record 2vb5 in PDB. "RWC" stands for "record without commitment": These URIs are meant to name (in the RDF sense) the record without commitment to encoding (XML, ASN, etc.) or to time (that is, the record may change over time, and the RWC URI does not commit to a version from any particular time).

We have talked about a variety of forms for these URIs. To some extent the choice is arbitrary. Here are some that have been put forth:

http://sharedname.org/record/pdb/2vb5
http://sharedname.org/pdb/record/2vb5
http://sharedname.org/pdb/2vb5
http://sharedname.org/pdb:2vb5
http://pdb.sharedname.org/2vb5
http://sharednames.net/record/pdb/2vb5
...

The insertion of "/record/" serves to keep the URI space tidy. For example, there are already URIs of the form http://sharedname.org/page/XXX, and it would be unfortunate if any application though that 'page' were a id-space name.

Other URIs

A second fixed function ABOUT(id-space short name, record id) maps to a URI that is meant to name an RDF page (document) that is "about" the record RWC and other closely related entities (see below). I'll call such a page or document an "about-document".

I haven't settled on a favorite rule for forming about-document URIs, but here are some ideas: for http://sharedname.org/path,

 http://about.sharedname.org/path
 http://sharedname.org/about/path
 http://sharedname.org/path,about

This rule works for other kinds of URIs, not just RWC URIs.

There will also be other functions XML(...), HTML(...) mapping to URIs for particular encodings or presentations of the record. The exact set of encodings and presentations available, and their precise forms, is of course specific to each databank. We have talked about a variety of forms for these URIs. Here are some that have been put forth:

 http://sharedname.org/xml/pdb/2vb5
 http://sharedname.org/pdb/xml/2vb5
 http://sharedname.org/pdb/2vb5.xml

Clearly this choice needs to be coordinated with the choice for the RWC and about-document URIs, so that a consistent form is presented.

We may choose to define URIs that apply to particular versions of databanks, or for particular versions of records within databanks.

Doing a GET on a shared name

The system is designed primarily for coreference, not reference, so applications are not encouraged to make use of services based on HTTP access to URIs belonging to the project. Nevertheless, if the URIs are successful, agents will want to look them up from time to time, for a variety of purposes.

HTTP GET serves two purposes in this system. One is for automated agents doing semantic web discovery; the goal there is to obtain some RDF (the about-document), and this is done by finding the URI of the about-document. The other is for the benefit of humans stumbling on shared URIs. If someone doesn't know what what is named, they can put the URI into a browser and at least get a clue. This is not necessarily meant to be a primary user interface to the databank record, but it can be used whenever a URI is found out of context.

When HTTP GET is heavily used in applications, a local proxy server should be set up to provide the needed services, and applications should be configured to use it.

Discovering the about-document URI from the RWC URI

Clients that can tell that they're dealing with a shared names URI needn't do any network access to determine the about-document URI; they can just apply a static rule.

Although the standards around this kind of thing are rather chaotic, two plausible protocols have emerged for determining the about-document URI for a thing given a URI for the thing: the 303 hack, and the LRDD protocol.

The 303 hack is hinted at by the W3C TAG's resolution on the httpRange-14 issue, and has been adopted by Tim Berners-Lee's 'tabulator' Firefox plugin and maybe some other Semantic Web tools. This is very easy to support: a GET on an RWC URI returns a 303 response whose Location: header specifies the RWC URI.

The proposed LRDD protocol gives three different methods by which a server can communicate an "about" URI to a client: Link: header, <link> element, and Link-pattern: provided by a host-meta document. It is very easy to support both Link: and host-meta, so we should do so.

Link: and 303 are compatible, as a single response may have a Link: header and a Location: header both giving the same about-document URI.

We might consider use of a 307 response possessing a Link: header (useable by LRDD); see below.

Note that in the 2007 Common Naming prototype, the response is not a 303 or 307, but rather a 302 that designates a second URI that in turn does a 303. This is because the URIs are based at purl.org and the OCLC PURL service doesn't provide the ability to do a 303.

The about-document

As the URI of the about-document names a document (perhaps one that is improved over time as the system develops), a GET will deliver some relevant RDF/XML via a 200 response.

(In the 2007 Common Naming prototype, a GET is handled by the OCLC purl server, so it returns a 302 redirect to a second URI handled by a Neurocommons server. If the URI resolves to an adequately equipped server in the first place, then no 302 is necessary.)

What is in the about-document:

  • rdf:type of the record
  • an rdfs:label property
  • a property for the record id, as a string
  • a property indicating which id-space the id belongs to (perhaps this is rdf:type) (careful, what if same record is in multiple id-spaces?)
  • links to HTML, XML, ASN, etc. encodings provided by the databank
  • links to third-party metadata, encodings (e.g. translations to RDF), or other resources (e.g. iHOP pages?)

By "link" I mean an RDF statement whose subject is the record (named by its RWC URI), whose object is a document of some kind (named by some URI), and whose verb is specific to the relationship between the two. The verb is drawn from some ontology that shared names manages.

Note that the about-document does not in general contain any content obtained from the record itself. Although this might be possible in principle, it goes against our hands-off approach and the idea of separation of concerns (naming system vs. content provision). It is also frequently impossible due to licensing issues.

An XSLT transformation creates an HTTP version of the RDF for human consumption. This avoids any temptation to do content negotiation, which would unnecessarily make the meaning of the about-document URI more confusing.

Implementing GET of RWC URI

The rule that maps the RWC URI to the about-document URI is uniform across all id-spaces and records, so it could be implemented as, say, a simple RedirectMatch rule in Apache. (Apache may be just one of many ways in which the system is supported.) For example, suppose the RWC URIs have paths beginning /record/, and the about-document URI consists simply of the RWC URI with the string ",about" appended. The following Apache directive would implement 303 redirection:

RedirectMatch 303 ^(/record/[^,]*)$ http://sharedname.org$1,about

This could also be done using a single partial redirect on a PURL server, assuming use of a PURL code base that supports 303 redirects. (The approach would have to modified under other URI syntax designs, such as ones with /record/ in the middle or end or absent.)

(TBD: Figure out whether 303 + Link: is as easy as this. Apache Header directive)

Other possibilities include 307 with Link: and even 200 with Link: (look at the clever thing that LSRN does, e.g. http://lsrn.org/PDB:2vb5). Although this would eliminate the possibility of a 303 response communicating the about-document URI, the response's Location: could direct browsers to the databank's user-friendly page for the record, and some people I've spoken with think this is a good idea. (Compare .) I do not recommend a 307 to the native site as the user would be deprived of the additional information supplied by the about-document. And any 200 response, with or without an intervening 307, may mislead someone into thinking that the URI names the document visited as opposed to the record itself. (I'm not sure I believe this, but it's the view espoused by the W3C TAG in its httpRange-14 resolution, and is a foundation of the linked-data architecture.)

This simple single-rule approach has no error detection. A first improvement would be to respond with 404 when an invalid id-space name is used. This could be done either using a more complex configuration (naming each id-space explicitly in its own Apache or PURL rule), or using a script that consults a list kept somewhere (a file, triple store, etc.). A second improvement would be to respond with 404 when there is no record possessing the given record id. This is harder to implement, as it requires synchronization with the databanks or possibly some other mechanism. This issue is under discussion.

Implementing GET of about-document URI

The about-document implementation is more involved and more open-ended in terms of supported features. The simplest form of a script is as follows:

  1. Extract id-space name and record id from the URI
  2. Look up id-space name in a table of RDF templates (part of the id-space registry)
  3. Fill in record id "blanks" in the template with the actual record id
  4. Deliver resulting RDF in a 200 response

Ways to keep track of such a table:

  • In local file system
  • In a triple store
  • Via a behind-the-scenes GET (for Neurocommons, the template is fetched in real time from a wiki)

It may be desirable to limit the insertion of third-party metadata links to those cases when such metadata exists (e.g. an errata database for Pubmed that only has entries for a small fraction of all Pubmed records). This presents an implementation challenge.

The same error detection (404) considerations apply as for the RWC URIs.

Other GETs

For shared-name URIs for encodings, the server can respond with a 307 redirecting to the appropriate location on the databank's home site. This is easily done either in Apache or in a PURL or PURLZ server.

(At the time of this writing, the PURL code for partial redirects can only substitute a PURL prefix for some fixed other prefix; suffixes cannot be either recognized or added. So if we wanted the encoding-specific URIs to end in, say, .xml or .html, then the use of the PURL code base would demand the addition of this feature to the code base. As Apache is regular-expression based, it has no such restriction.)

Optional: for LRDD, a Link: header in the response could point either to the same about-document as the one for the RWC (assuming we consider whatever description it carries of the encoding to be adequate), or to a separate about-document that is specific to the encoding.

Replication

Operationally the system consists of two services, the 303 service for the RWC URIs and the 200 service for the about-document URIs. These may be hosted on the same server or on different servers. (It is not clear whether this is useful, but obviously to allow for different servers for the two services, the URIs would have to have different host name components, e.g. sharedname.org vs. about.sharedname.org.)

As discussed each service should be replicated both physically and administratively for the sake of availability. The state required to provide the services (the registry and templates, and perhaps other kinds of information) might be stored centrally and pushed or pulled out to slave servers; or a more distributed information flow could be established. Information could be propagated using existing tools such as rsync, Mercurial, or Git.

If the required state is shepherded somehow by a PURL or handle server (something I have not thought much about), then those systems would have to support master/slave state propagation. I believe the handle system already has native support for replication, while Benjamin Dai is looking into how to support this for PURLs.

Maintenance

The id-space registry and about-record templates have to be kept up to date, with errors corrected and new entries added when needed. Databanks move around and change their internal organization. This of course requires human intervention, and is a very fragile part of the proposition. The key is to involve those who care about using the system, and to empower them to perform needed maintenance.

Most routine operations should be doable via simple web forms and/or wiki-like interfaces. For example, partial redirects that are implemented by a PURL server can be added and modified using the web application built into the PURL code.

We can set up one or more Nagios clients to monitor correct operation of the deployed servers.

Nice to have

Probably the id-space registry itself should be a id-space. Then each entry in the id-space registry will have its own shared name.

A third-party errata and metadata service - one that would maintain information related to particular records, independent of the record's publisher - would be a good thing. In principle the shared names project could play a role in setting this up, in coordination with the rest of its infrastructure. Point for discussion.

Then comes the hard part

This document is meant to help demonstrate the feasibility and simplicity of the core of the system. There are other issues that are more difficult than the basic services. If we can understand that basic services are essentially easy, then our minds will be freed up to work on the harder questions. These include:

  • Ontology - we will need classes relations between the various information entities (id-space, RWC, encoding), and those will have to be designed, hosted, and maintained somehow
  • Design for maintenance - making maximum use of supported externally supplied software, to reduce deployment and maintenance costs, minimze the project's software base, and offload risk
  • User-friendliness - future maintainers of the system should be able to make changes to id-space templates, encoding redirects, and other aspects of the configuration without having to know a lot (their experience and skill sets will vary wildly)