Issue/Case and canonicalization
email: Peter Ansell's response raises new issue
Peter says
On the agenda should also be the issue of case-sensitivity. I have had a large amount of trouble with the current assumption in Bio2RDF that lowercasing everything is likely to be more consistent than keeping the case of identifiers that are given by the provider. Obviously, I would prefer that people just didn't modify the case that they find on the official provider database interface or database dumps, but people do inevitably modify them in some circumstances it seems. If you are going to reference dbpedia/wikipedia for example, you can't go around trying to normalise the case to an arbitrary standard like lowercasing, as the identifiers are very case-sensitive with different articles being referenced in some cases if you change the case.
Hopefully not giving too many ideas, but you might also want to discuss what best practice preference you want to give to full percent encoding of identifiers as opposed to either percent and plus encoding for spaces, or no encoding at all. I would prefer a full percent encoding within identifiers for any potentially reserved character, and UTF-8 encoding prior to percent encoding for non-ASCII characters.
I don't like the typical urlencoding scheme with encoding spaces as "+" because it creates ambiguities if people really have + symbols in their identifiers, %20 is more consistent for space encoding). As with you, I don't like the idea of redundant identifiers, except for cases like HGNC/HUGO where both distinct namespaces are primary keys on the database, and both useful for people trying to reference the record. I definitely don't like the idea of redundant identifiers within namespaces, ie, hugo:Example+Symbol, hugo:Example%20Symbol and hugo:example%20symbol, would be a less than useful best practice in my opinion from my experience working with Bio2RDF.
I think the issue is what the urls denote. For shared names we've said the urls denote *records*, rather than *identifiers*.
What you suggest seems more along the lines that we should be naming primary keys, i.e. identifiers.
If we have two URLs for the same *record* then we have a problem that 1/2 the people can use one, and 1/2 the people can use the other, and then we are into trouble.
I don't have a principled reason to choose the names over the numbers other than the fact that the names are unlikely to be as stable, and therefore seemingly less suitable for a project such as ours.
On the issue of case sensitivity, I think we need to think of our use-case - RDF. RDF identifiers are case sensitive (and indeed canonicalization sensitive). So I think we can't tolerate different case spelling of our identifiers and will need to figure out a consistent policy.
There's no reason we can't include such auxiliary information as alternative case spelling or other primary keys as metadata.
