Bayesian Nets, Latent Semantics, Despamming and other speculations David Weinberger writes about the application of Bayesian networks to client-side spam detection software,and gets taken to task for putting OS X Mail on the list of products, when it actually uses latent semantic analysis. Kevin Marks knows that I played in this domain once upon a long time ago, and asks for a comparison and comment. Okay: Bayesian networks, also known as belief networks, are indeed used in spam detection, including in server-side products such as Brightmail. (Disclosure: I have a dog in this fight.) A Bayes net is a graph structure whose nodes are terms or concepts with attached truth values or probabilities, and whose links are conditional probabilities. A conditional probability is a statement such as:
Latent semantic analysis (or indexing) is an application of what's called principal components analysis (PCA), or factors analysis, to the domain of information organization. In the basic version, you form a big 2-D matrix with documents (e-mails for instance) along one axis and terms (word, phrases) along the other, and fill in the entries with a 0 when the term doesn't occur in the document, and with a 1 (or count) when it does. Then you take the resulting monstrous matrix and grind it up with an algorithm that finds covariance patterns. That's to say, the associations of words "latent' in the document base you feed in are going to be found. Shovel in several weeks worth of news stories and it's going to be obvious that 'Saddam' and 'Iraq' are highly correlated, or 'Tiger' and 'golf'. The method actually kicks out a transformation matrix into which you can feed the terms observed in a particular document, and get out a score for that document in terms of "warness" or "golfness" - those are principal components, or factors. You compute and save as many factors as you want - presumably less than the number of original terms. (Apologies to any wandering mathematicians for the gross simplications.) Latent Semantics was invented at Bellcore in late 1980s, by an All-Star information science team including Tom Landauer, George Furnas, and Sue Dumais. (IP alert: It's patented.) There's a more recent description from Tom Landauer's university group here (PDF) and another overview under Creative Commons license here. The algorithm I hand-waved up above is called 'singular value decomposition.' Back in the early 90's I and a team did a project called Guides (PDF) using the latent semantics technique. Since it was a small problem and we had access to free cycles from a Cray, we bludgeoned it to death with an off the shelf math library. For real world use, though, you need a variant of SVD that takes advantage of the sparseness of the document matrix. One of the world's experts on sparse SVD is Dr. Michael Berry at UT Knoxville. Back when we were sending him some funding to port his algorithms so we could use them at Apple. And it looks like he's still banging away on reducing the computational costs ten years later. The good things about latent semantics with the SVD include:
So what's Apple doing with latent semantics to catch spam? Not sure. The simplest approach is to use a related factor analysis technique to find the best fit to predicting spam/not-spam in a training sample; it's not a full PCA but I suppose you could call it latent semantics. It would be more interesting if they are using the full deal, maybe computing separate models for spam/not. Because, you see, latent semantics naturally lends itself to automatic sorting and organization of the document space over which the model was computed. And a few years back, the Apple group that I and then Dan Rose managed did an automatic e-mail organization project rather unfelicitously called piles, that included a user interface for just such a thing. (IP alert: Some of it was patented.) Hmmm.... (BTW, if any of the folks involved in Chandler are interested, I will spill my guts on both that project and 10 years more of private speculations for a remarkably small amount of microbrew or decent wine suitably applied.)
So which one works better for spam? Beats me. My own and my wife's private e-mail are on Earthlink, which uses Brightmail, and it leaks like a sieve. (Did I mention we have a portfolio company in this area? Heh-heh.) I'll be making the plunge into OS X Mail in a few days myself, so maybe I'll have an A/B comparison after a while. My suspicion is that either can work reasonably well, the biggest problem is tracking the moving target presented by the (expletive deleted) spammers, forcing continual retraining or modeling. |
Off Topic: Sumo Exports The PejmanPundit has some questions about the ceremonies surrounding sumo matches. Since I've noticed some .jp visitors wandering by here, would someone go over and help him out? I'm interested, too.
And while we're on the subject, if any of you have pull with the powers-that-be who own the TV franchise on the basho's, could you get them to send us poor gringos something better than the blipvert-speed, months out of date stuff we get now? With an English-language commentator who knows his stuff - I know they exist, I've watched them while killing time in Tokyo hotels. It could be a franchise - Americans are proven to like big guys crashing into each other. Some cultural imperialism, please! |
Gordon Mohr at O'Reilly describes a Sony device just being released in Japan: Handheld, 390 grams, WiFi enabled, with a 20GB embedded Linux powered file server. US price would be about $565 if released here at the same markup. Now considering that a quick check at Outpost shows me an SMC 802.11b AP for $50 after rebate, and $65 for a bare 20GB drive, that would seem to say that an MPU, a Linux kernel, and some integration and packaging are supposed to be worth $450. So what's the deal? It's the packaging that might count here. Sony is likely trying to do an Apple. After all, they get folks to happily cough up $500 for a 20GB drive packaged into an iPod (with no wireless), but with a brandname. Both Sony and Apple have that. And we all know what's going to be stored on that Sony gadget when it goes to a party, right? I'm sure they'll love it over at Sony Music.
Via Mochio Umeda |