Sunday, March 21, 2004

My Kingdom for an IP Lawyer!

I just happened to be reading the above NYT article on namespace and copyright clashes in the Internet age, when a particular statement in it (highlighted below) caught my attention, for a reason I shall explain shortly.

Certain namespaces have grown dangerously overcrowded. Pharmaceutical names are a special case: a subindustry has emerged to coin them, research them and vet them. The Food and Drug Administration now reviews proposed drug names for possible collisions, and this process is complex and subjective. Rigor may be impossible, and mistakes cause death. METHADONE (for opiate dependence) has been administered in place of METHYLPHENIDATE (for attention-deficit disorder), and TAXOL (a cancer drug) for TAXOTERE (another cancer drug). Doctors fear both look-alike errors and sound-alike errors: ZANTAC/XANAX; VERELAN/VIRILON. Linguists are devising scientific measures of the ''distance'' between drug names. But LAMICTAL and LAMISIL and LUDIOMIL and LOMOTIL are all approved drug names. Meanwhile, of course, drug companies have other worries; they spend millions on market research to make sure their names are both serious and sexy. ROGAINE, the hair-growth treatment, was deliberately chosen to make you think ''regain.''

Now, this article is interesting in its own right, but as I've said, that isn't why I'm mentioning it. The real reason for focusing on it here is that the particular sentence highlighted happens to deal with an issue I myself dealt with a few years back while doing research on information retrieval - how does one determine a method for measuring similarity between words, in the sense of imposing a metric on them, where "metric" is meant in the strict mathematical sense? By this I mean that given any words ABC, DEF and GHI, and using D(X,Y) to denote the distance between X and Y,

  1. D(ABC,DEF) ≥ 0
  2. D(ABC,ABC) = 0
  3. D(ABC,DEF)=0 ⇒ ABC=DEF
  4. D(ABC,DEF) = D(DEF,ABC)
  5. D(ABC,GHI) ≤ D(ABC,DEF) + D(DEF,GHI)

The preceding properties are very nice to have for mathematical reasons, as they are in fact an abstraction of the properties enjoyed by our conventional notions of distance, and therefore well-studied. As it turns out, I did actually find a satisfactory (and non-obvious) solution, and, perhaps most importantly, one that wasn't too computationally costly to be worthwhile. It was clear to me at the time that this was an idea worth getting intellectual protection for, but I lacked the resources to do so at the time. Maybe I ought to write James Gleick to find out exactly was looking for the the distance measures he wrote about ...