Monday, March 01, 2004

How to Roll Your Own Technorati

I pointed out a while ago on Brad DeLong's website that writing a program to emulate what Technorati's doing was neither original nor particularly hard, and that I could in fact do so in a single day. It seems some clowns out there found this hard to swallow, so I'm writing this to show just how easy it really is.

The basic thing to realize is that Technorati is basically a combination of spider and database. The database aspect is straightforward enough that I shouldn't need to go further into it, but the spidering aspect is a lot easier than many people realize. All one has to know is that there's a Perl library out there called libwww-perl (or LWP for short), and that when writing spiders, LWP is your friend. In fact, LWP makes writing spiders so easy that one can even do so in a single line (well, almost ...) The only thing one needs to be especially aware of is the formats used by the most popular blogging tools (Movable Type, Blogger, Livejournal, etc.) for indicating links to individual posts, and of course one needs to keep a running tally of the number of ingoing links to any node on the link graph.

There really isn't much else to Technorati other than what I've mentioned here. There's a trackback mechanism for informing the spider to do an update of one's site, rather than wait until some predetermined number of hours have passed, and there's a bit of javascript that can be used as a bookmarklet to send queries to the Technorati database. There's absolutely no sophisticated post-processing of the data gathered that could intimidate a would-be imitator - no graph-theoretical analysis of the link matrix, no NLP-based analysis of the posts, nothing out of the ordinary beyond a simple counting of how many ingoing links a site has. This isn't to rubbish David Sifry's achievement, as I think he's providing a valuable service to the blogging world at no charge whatsoever. My only point is that this stuff just ain't that hard if you know what you're doing, and anyone who stares slackjawed at my saying so obviously doesn't have much programming experience.

UPDATE: Anyone still in need of inspiration would be well-advised to get a copy of the Perl Cookbook. Look at Chapter 20 ("Web Automation") and proceed from there.