Science Commons, OBO Foundry, and several other groups working with biological information on the Semantic Web have initiated an effort to establish shared community-maintained URIs for important data integration points such as life sciences database records (example: the record in the Entrez Gene database that has GeneID 1003064). We are attempting to determine the right technical and organizational recipe that will lead to the uptake of these URIs by as many RDF-using bioinformatics projects as possible.
Were this to be a success, one would be able to "mash up" information from multiple sources in a wide variety of ways, using the shared URIs to forge links between pieces. Information could be extracted from one source for inclusion in another; information could be combined in SPARQL queries using multiple FROM clauses; extractions from several triple stores could be combined to create a new one. Without shared URIs, this kind of integration would still be possible, but it would be significantly more difficult as it would require some kind of plumbing or mapping to get the two result sets to link up.
We are limiting our concern for now to URIs for records in databases mentioned in external links from the Gene Ontology (GO), such as those from Enzyme or Pfam, with possible extension to other databases such as those in the LSRN (Life Science Resource Name) registry.
The URIs we propose are meant to serve the community of Semantic Web-enabled bioinformatics projects that need URIs that refer to or denote database records. We identify the following requirements for URIs to serve a community and to be shared among Semantic Web projects:
- It must be clearly stated what the intended referent of each URI is supposed to be, i.e. that the URI denotes some particular record from some particular database.
- Information about the URI and its referent, including such a statement, must be made available, and in order to leverage existing protocol stacks, it must be obtainable via HTTP. (We'll call such information "URI documentation".)
- URI documentation must be provided in RDF.
- Provision of URI documentation must be an ongoing concern. The ability to provide it may have to outlive the original database or the database's creator.
- The provider of the URI documentation must be responsive to community needs, such as the need to have mistakes fixed in a timely manner.
- URI documentation must be open so that it can be replicated and reused.
Control of shared URIs should be in the hands of those who depend on them. This is the best way to ensure that the URIs serve the community in the ways listed above.
The idea is to manage a server, or a set of servers, that can deliver appropriate RDF documents for the shared URIs. An HTTP GET of a URI would retrieve not the database record itself, but rather one of these RDF documents (perhaps via a 303 redirect). The RDF document will include the following:
- Documentation specifying what the URI denotes (URI documentation), including an rdf:type, the database that the record comes from, and the record's identifier or key within that database.
- Links to the various encodings of the record provided by the data provider, e.g. XML, ASN, HTML for the NCBI databases and so on. Each of these encodings would in turn have its own URI naming that particular encoding of the record, the main URI being the name of the record "without commitment as to encoding".
- Links as appropriate to corresponding resources belonging to semantic web projects participating in the common naming scheme, e.g. http://bio2rdf.org/pmid:15456405.
- Links as appropriate to related external resources that build on the database record. For example, the RDF for PubMed record 15456405 could link out to the iHOP page for the article described by the PubMed record.
- Other information related to the record that might be of use to a human reader or automated agent.
External links can be represented rigorously in RDF, allowing programs that access the RDF to proceed deterministically not only to the various encodings but also to any of the various related resources. This might be accomplished by having a distinct property for each participating project, or in some other way.
We do not generally expect the RDF to include any of the content of the database record, although there is no particular reason to rule this out (except in cases where license terms preclude it).
There will be a steering committee, made up of stakeholders, to make decisions about the system (domain name, URIs, content of served pages) and direct its management (who holds the domain name, who operates the server and updates scripts).
The served pages will be RDF/XML. This puts their accessibility to humans at risk, but it is possible to use a GRDDL transform so that a human-readable (HTML) form can be provided without having to resort to content negotiation.
We believe it is important to follow practices around HTTP that are consistent with the HTTP specification and that are being promoted by W3C.
A server, or servers, will have to be put into operation. Ideally there is redundancy so that downtime is limited in case there is some kind of failure. The server will for the most part just run some simple scripts, filling in RDF templates with particular identifiers, so infrastructure demands should be minimal.
We plan to make heavy scripted use of the server unnecessary by providing a number of technical alternatives. If one is examining 10,000 script-generated RDF documents, there is no reason to do 10,000 HTTP transactions to the server, as their contents are predictable. It must be easy to replicate the script on other sites, and to re-create large quantities of RDF triples as needed.
Science Commons has started a prototype of a naming system designed along these lines. The prototype employs purl.org and has very modest software and hardware demands.
Questions under consideration:
- what domain name to choose
- choosing or establishing a registry of short names for the databases in question (there are several of these lists kicking around, e.g. the one at lsrn.org)
- syntax: database/type instead of type/database (as in the prototype) (type = record, xml, html, ...) to better support delegation within purl.org's authorization framework
- writing key.type instead of type/key to support programs that determine file type from file extension (this is not currently supported by purl.org)
- consider whether the generic "record" type (without commitment as to encoding) should be explicit or implicit in the URI (e.g. http://foo.bar/database/key vs. http://foo.bar/database/record/key) - brevity vs. uniformity
- writing .../FOO_key instead of .../key to support the use of Qnames in RDF/XML and Turtle (Qnames cannot follow : with a digit) (this is not currently supported by purl.org)
- consider : instead of / as a separator
- pointing the domain name at purl.org, so that we can use the purl.org server for the time being without being tied to it indefinitely (this is the approach taken by purl.obofoundry.org)
- treatment of database and record evolution through time (versioning)
Related reading on the Neurocommons web site:
Thanks to Mark Wilkinson, Philip Lord, Sergei Egorov, and Peter Ansell for provocative discussions.