Archive for March, 2007

Data Mining Mailing List Archives with DEVONthink Pro

I’ve written a little in the past about DEVONthink Pro, a research tool for Mac OS X that I bought last summer. I’ve also written more recently about a little experiment that I’ve been meaning to perform in an attempt to make better use of the software: namely, to import the FOX mailing list archives into DEVONthink to see what kinds of data mining I could do on those archives.

My friend Charlies provided me with his archive of about 9600 e-mail messages from the mailing list (in mbox format, if you’re familiar with it) and so the first challenge was to decide how best to import them into DEVONthink. I ended up writing a little Ruby script to chunk the file into separate HTML files (one per message) and then imported them into a new DEVONthink database. So far, so good.

Next, I wondered how well DEVONthink’s “Auto Group” feature would stand up to the challenge of grouping this volume of data. I tried a couple of small samples — say, selecting a hundred or so messages and auto-grouping them — and it finished the job in a few seconds time (and it did an admirable job of grouping them, I might add). But I did wonder what the “big-O” level of complexity was for this operation, and if would cause my MacBook to burst into flames if I ask it to auto-group all 9600 items. But I charged ahead and tried anyways.

I selected all of the message items and asked DEVONthink to auto-group them. The first pass, in which it “compared text” for the items, took approximately 20 minutes. The second pass, in which it actually grouped the items, took another 35 minutes. So, close to an hour’s worth of processing on my 2.0 GHz MacBook with 1Gb of RAM — but it did finish. Note that these are all relatively small files; I’m assuming the problem becomes progressively more difficult with larger documents.

I was a little surprised, and disappointed, by the results. It did create 303 new groups, each of which contained a handful of messages — presumably with each group corresponding to all of the messages in a given thread — but it left the remaining 9000 or so messages ungrouped. I think I’ll try again to auto-group those remaining messages, but I assume that it will take another hour to do so and I’m not sure what to expect at the end of the run; will it attempt to construct groups for those items this time?

Track Blog Comments with co.mments

You know, I’ve seen references to co.mments for awhile now, but for some reason my brain filtered it into the “look at later” pile.

co.mments is a service that assists you in tracking the conversations that take place in the comments for a blog post. If you are in the habit of reading blogs, I think you’ll agree that some of the most interesting information comes out of the responses from readers, and the back-and-forth between the author and readers. Some blogs of course provide RSS feeds for a posts’ comments — at least, I’m pretty sure I’ve seen that before — but it’s not something I see universally. So co.mments provides you with a central place to track all of those conversations, and includes such niceties as a bookmarklet for tracking the post you’re currently viewing, an RSS feed for all of your tracked conversations, and e-mail notifications of changes to those conversations.

Ridiculous Business Jargon Dictionary

The Ridiculous Business Jargon Dictionary is a sort-of Urban Dictionary for the workplace. I’ve heard a lot of these before, but some are new to me, e.g.

Malicious Obedience [n.] The act of following a boss’s instructions explicitly, while hoping for failure. It can also involve remaining quiet about any poor judgement or discovered mistakes.