Meetings/2009-04-29/Minutes

Attendees

See Also: Benjamin Dai's meeting notes

Contents

Review of shared names problem, goals, and scope

Jonathan Rees presented a short (30 min) presentation on his thinking about Shared Names to date:

  • HCLS focus for identifiers (biomedical, specifically)
  • They want to use SemWeb stds for data integration and computational reasoning.
  • If data sources share URIs, they connect
  • distinct names for the same thing can be connected through concordances, but this is fragile.

"we"/"they" == Shared Names steering committee

In Scope

Records in public life sciences databases/repositories. All others are out of scope. "Start small and expand later". Further, each entry already has a name - don't rename them. Instead, map the databases to a common namespace.

Out of Scope

  • biological entities
  • chemical entities
  • documents referenced from identifiers

Bioinformatics practitioners end up referring to a small number of well-known databases (e.g. Uniprot, NCBI, pubchem). There is no common identification scheme. In each implementation, an identifier is created for each database (PURLs, LSID, LSRN, DOI...).

See Benjamin Good and Marc Wilkinson, 2006 - Biodiversity informatics. Dan Connolly, 2006, said to Jonathan, "if a community needs identifiers, it should organize itself".

OBOFoundry is going to ontologically describe entities using the same approach as above - Alan: "Ontologies won't be a poor cousin".

The project so far:

  • high availability lookup via HTTP
  • ontological clarity regarding what is named.
  • responsive to community input.
  • convincing persistence story (longevity)

Jonathan: "The point is not to make names. We've seen that over and over again. The point is to make names that will be used by multiple projects."

Can we agree on a set of existing URIs? Check each URI against existing requirements.

Requirements

  • names
  • lookup services
  • content (state)
  • administrative interface
  • infrastructure (technical replicates, local mirrors, automatic failure at DNS level, master/slave with reassignable master envisioned - like Linux).
  • responsible social entity (members can come and go, DNS held by group, responsible for what the domain responds).

The prototype that Jonathan and Alan developed uses several pieces, based on PURLZ. They use a template (per database) to create "about document" (metadata), pointed to by 303s.

NB: Alan and Jonathan: Have no plans for community curation; they prefer curation by the steering committee to provide strong review and trust. Dave Hau: May need to modify templates if underlying databases change structure.


Demonstration

Alan gave a demo of a prototype (PURLs, wiki)

  • 5 servers - 1 master (a.demo.sharedname.org), 4 slaves (b-e). The last one is at purl.org (OCLC) and requires a bit of hacking to set up as a slave since they are limited to the public interface.
  • DNS-based load balancing (multiple A records for each host)
  • Different versions accessible via URL encoding (e.g. http://demo.sharedname.org/n/html/pmid/16124938 returns html, and http://demo.sharedname.org/n/medline/pmid/16124938)
  • Records are RDF, with XSLT stylesheets to allow HTML display in browsers. Includes DOAP information back to the project.
  • Showed monitoring of servers using nagios - may monitor down to the record level, perhaps using PURL validation.
  • Showed ontology editing using mediawiki.

LSIDs included a concept of third party metadata services - point to external metadata about a resource. The problem with that approach (Alan) is possible ontological inconsistency.

Harold Solbrig: If there could be a way to map existing identifiers in existing databases to new shared names, that would facilitate adoption. Mapping service?

Doug Burke: You can't stop someone making RDF statements about your work. May want provenance information added to metadata. Alan: agree.

Michael Halle: Reverse name service would allow leverage of existing services.

Harold has expressed interest in working on a reverse name service. Trish and Benjamin suggested that NCBO could help. Alan suggested that this was valuable, but outside the scope for the Shared Name steering committee.


Perspectives Talks

Clinical research data interoperability (Bosse Andersson)

See presentation

  • Clinical perspective from AstraZeneca. Their desire is to assist collaboration between scientists seeking new drugs with common identifiers.


LSRN (Sergei Egorov)

See LSRN.org

  • RDF based and driven. URN mapping.
  • Users in pharmas moved to local copies running inside their firewalls to avoid exposing business intelligence. The public site is maintained, but no longer actively used.


caBIG (Dave Hau)

See caBIG Compatibility Guidelines and Concept ID's

  • caBIG: Published a compatibility guide for achieving syntactic and semantic interoperability.
  • Includes a "caBIG Concept Identifier" on caGrid: Point to both concepts and resources.
  • Moving toward a federated architecture.
  • Based on collaboration with UK cancer team, academia.
  • Focused on long-lived data sources, such as NCI Thesaurus, ISO/IEC 11179 metadata registries. Patient records are out of scope.
  • They serve an *exact replicant of a record*
  • "Consistent representation is required, but not a consistent resolution strategy"
  • Representation scheme is OID: (HL7, not much tool support). They considered using handles, but have not decided.


DSPACE and Simile (Mackenzie Smith)

  • Long term governance scheme is critical (experience from DSPACE)
  • Get large adoption, and make it long lived (ed: learn lessons from LSRN)
  • Whoever governs this shouldn't be individuals, nor project-based org.

Ideally, it should be an international org (e.g. not Library of Congress).

  • The business model of CNRI met the above requirements, and thus handles

were chosen for DSPACE. PURLs, on the other hand, were not seen as committed to by OCLC - MacKensie could not find evidence of OCLC's institutional commitment.

NB: Shared Names has bylaws as a non-profit association, and reserves the right to become more formal. Alan: The model they are trying to follow is Linux. Michael Halle: But the definition of "long-lived" may be different to Linux.

She made a good point that Librarians have long understood the relationship between a card in a card catalog and the book to which it refers. A discussion of Crossref/DOI registry pricing ensued.

Scott suggested that Shared Names is aiming at a federation from the beginning for governance. Alan disagreed - saying that Shared Names as an entity would mostly manage the DNS. MacKensie says that the operation of a legal entity is non-trivial; the editing role is also critical and takes work. Alan: Operate as a board, with a succession plan. MacKensie: Spend as much time as possible trying to find an existing entity to fill the governance role. Harold: You are presuming that you need no money for operating. Alan: We hope to get funding once it is running. MacKensie, Harold: Seems dangerous.


Bio2RDF (Marc-Alexandre Nolin)

  • Create RDF for existing databases.
  • Followed TBL's 4 rules:
    • use URIs to name things
    • use HTTP URIs
    • Provide useful info when someone resolves a URI
    • Include links to other URIs.
  • Magic predicates are used.
  • Blank nodes are forbidden (within a database record).
  • At its simplest, Bio2RDF is not software, it is a set of rules to follow.
  • Working on reverse links, using a Tomcat-based FLOSS project.
  • Using Virtuoso, Tomcat, Freebase, Apache, Taverna.


PURLZ (David Wood)

See presentation

  • Introduced PURLs, with a focus on requirements and Web-centricity.
  • Noted that the new PURL system is defined by types and RESTful API. Like Bio2RDF, it has become a spec and not a system - all implementations should be cross-functional.


Handle system (Larry Lannom)

  • "DNS for objects" You send it a handle and you get back a list of objects.
  • "It is for exposing data that you manage"
  • Binary protocol over TCP
  • Reiterated need for organizational commitment. External funding dries up.
  • CNRI close to even; ~$300K per year cost to host. Parts of 4 people plus hosting costs.
  • Nominal charges for services ($50/year) not software. "Works"
  • Keep it simple - ownership initially had 9 levels of control and was impossible to manage.
  • "Dependency breeds collaboration" - become necessary and you won't be allowed to fail."Do not derive persistent names from changeable attributes of the named entities" (e.g. hostnames, DNS) "Otherwise, you don't control your namespace."
  • People care a lot less about numbers than semantically meaningful names. You get the next HS prefix by adding 1 to the last prefix. Simple and adaptable.
  • Levels of indirection enable changes over time. Entire blocks of handles have changed ownership over time (e.g. M&A).


Issues List

See Issues List

  • /Provider partitions‎
 /Provider partitions/Biased unqualified
 /Provider partitions/Normalize taxon
 Providers with bad/ugly URLs may be hard to map to clean URLs.
  • Dissemination
 No resolution.
  • 404
 No resolution.
  • /One ID per record‎
 We will have a problem if multiple IDs are created that refer to one record.  How should the curator clean up the IDs?  Remove dupes?  Retain dupes?  Point dupes to canonical ID?
 Need some tool support for curators.  If a curator tries to create an identifier that points to an existing target, a flag could be raised.
 Others may provide disambiguation services.


  • /OCLC and purl.org
 OCLC supports libraries, not HCLS companies/orgs.  They need to be approached to determine whether they will support Shared Names within or alongside purl.org.


  • /Server online failure
 If you use a server that goes offline, you will fail to resolve.  Is it sufficient to monitor the servers and update DNS?


  • /What metadata to record
 Alan collected the union of metadata supported by various systems.  Some discussion, no resolution.


  • /Form of URIs
 Should URLs encode optional format information for the returned resource?  Perhaps, but in accordance with tag guidance and best practices.
 Versions of a resource are treated as having separate identifiers.
 There is a unique name assumption within the Shared Name system.
 Versions:  The "truth" resides in the RDF.  Analysis of the RDF can sometimes allow disambiguation without dereferencing URIs.

Michael Halle's use case: Snapshots of "latest" URIs? Alan suggests that the consumer would have to snapshot themselves. There may be better ways, e.g. multi-level redirection...


  • /What distributions
 No resolution.


Possible Best Practices

  • If you refer to an external resource, you add it to a Shared Name's seeAlso entry.
  • If you refer to a name for an out-of-scope resource (e.g. a biological entity), then you use an existing name if it exists. If one does not exist, then what?

NB: "resources" in the Shared Names sense refer to "records" - descriptions of the end resources.

Action Items

  • Explicitly define what HTTP content negotiation would result in.

Parking Lot

  • Jeffrey Grethe's use case
  • Encoding of result format in URIs not optimal - move the format to a suffix of some form (but requires a change in the PURL software to allow partials that have URLs longer than existing partials - Resolution: fix in PURLs).
  • Performance issues:
    • Hitting databases for information that will never be read?
  • Michel's reference to a registry: include Bio2RDF and LSRN in the list of (records, Alan: encodings). Scott's interpretation: supporting migration and compatibility
  • Alan: don't want the dragons of attempting ontology. There are at least two ongoing efforts (NCI, OBO). Harold mentions possibility of merging them.
  • Olivier we should try to bridge the ontology approach with SN. Alan: ontologies typically refer to dbx records.
  • Sergei: A pattern to connect SN with data (SKOS, etc.) USE CASE! Please place on wiki.