Friday, June 18, 2004

Google and Metadata

Normally I'd put this in my other blog, but this is of sufficiently general interest that I'll talk about it over here. In reference to a statement by Jon Udell, Brad DeLong says the following:

It's worth writing down a few speculations about just why it is that people like me and John Udell post things on our weblogs as a way of finding them later:
  1. We get to use all of Google's computers to search that subpart of their full-text database of the web that is in our websites--and that's a big plus.
  2. We get the free addition of keywords to files on our websites. Google uses the text people use to link to a file as an indicator of its contents. And the words people use in a link are good keywords.
Any system for "finding stuff" that is going to beat post-and-Google has to figure out a way of doing at least as well at these two tasks--quick full-text search on the one hand, and intelligent (and cost-free to us) creation of keywords to files on the other. This is going to be hard to do.
Now, there are two points worth noting here. The first is that the automatic keyword creation Brad DeLong desires is not only easy to accomodate, but the infrastructure is already mostly in place, at least for users of Windows 2000 and Windows XP, both of which come with a built-in Indexing Service. Documentation on it is patchy, and the service is quite resource intensive, but this is due more to a lack of interest on Microsoft's part than to anything else - keyword indexing is far from being on the cutting-edge of information retrieval.

The second thing I'd like to discuss is why exactly it is Brad finds it so much easier to discover things he's interested in by using Google to search his own website than it is for him to do so by trawling through his own machine, and it's a point that has a great deal of bearing on the likelihood of success of ambitious proposals like Microsoft's WinFS. As Brad himself notes, the real reason why Google is able to do such a good job with his data is rather mundane: by leaving his notes on the web for all to see, Brad DeLong makes it possible for others to link to and comment on his posts, and in so doing they create for him the metadata required for him (and anyone else) to do efficient searches of his own website. Metadata creation can be tedious in the extreme if one has to do it all oneself, but given the combination of a web of millions, each of whom need only do a tiny bit of the job, and a high-traffic website which gets thousands of reads on a daily basis, all the ingredients are in place for Google to return results of far better quality than any desktop tool can currently offer.

What does all this mean for WinFS? Well, for one thing, given what we know about the (lack of) advances in AI over the last few decades, it means that the only way WinFS could possibly come close to living up to Microsoft's promises for it would be for it to expose the private datastores of Windows users to the rest of the world, a prospect that is as uninviting as it is unlikely, especially for a company with as poor a record on security as Microsoft's. Google needn't worry too much about competition from Redmond just yet.