Due Diligence

Monday, February 10, 2003

Bayesian Nets, Latent Semantics, Despamming and other speculations

David Weinberger writes about the application of Bayesian networks to client-side spam detection software,and gets taken to task for putting OS X Mail on the list of products, when it actually uses latent semantic analysis. Kevin Marks knows that I played in this domain once upon a long time ago, and asks for a comparison and comment. Okay:

Bayesian networks, also known as belief networks, are indeed used in spam detection, including in server-side products such as Brightmail. (Disclosure: I have a dog in this fight.) A Bayes net is a graph structure whose nodes are terms or concepts with attached truth values or probabilities, and whose links are conditional probabilities. A conditional probability is a statement such as:

if word 'enlarge' then p(spam) is .8
if concept open-mailer then p(spam) is .5
if phrase 'business plan' then p(spam) is .05

Bayesian networks were popularized in the AI community in the late 1980s by a book by Judea Pearl (the father, BTW, of Daniel Pearl, who copyedited the book). Bayes networks were an advance at the time, because they introduced rigorous probability theory to the usually ad hoc hypothesis scoring systems used in AI systems. The price is a lot of computation in ensuring consistency among the various pieces of evidence and hypotheses. Good things about Bayesian networks:

They deal well with multiple sources of evidence and multiple, conflicting hypotheses.
A Bayesian network result can be 'explained' in terms that may be understandable to a human, since it uses explicit probability statements and descriptions of hypotheses and knowledge states. This can be very important in some settings, for instance medical systems (PDF).
When there is existing domain knowledge to be engineered into the system, Bayesian networks provide a good framework for capturing it due to their explicit nature.

And of course there are downsides to the technique:

A densely inconnected Bayesian network can be computationally challenging.
The algorithms will only work on acyclic networks, that is, no circular reasoning is allowed, even though some domains will naturally generate such rules. (There are ways to wiggle out of this, at a cost of more complexity.)
Since Bayesian nets are often hand engineered, if the domain of interest is a moving target (spam, for instance), the models can get out of date and become expensive to maintain. They are ways to partiallly get around this by automatically generating networks from training sets.

Latent semantic analysis (or indexing) is an application of what's called principal components analysis (PCA), or factors analysis, to the domain of information organization. In the basic version, you form a big 2-D matrix with documents (e-mails for instance) along one axis and terms (word, phrases) along the other, and fill in the entries with a 0 when the term doesn't occur in the document, and with a 1 (or count) when it does. Then you take the resulting monstrous matrix and grind it up with an algorithm that finds covariance patterns. That's to say, the associations of words "latent' in the document base you feed in are going to be found. Shovel in several weeks worth of news stories and it's going to be obvious that 'Saddam' and 'Iraq' are highly correlated, or 'Tiger' and 'golf'. The method actually kicks out a transformation matrix into which you can feed the terms observed in a particular document, and get out a score for that document in terms of "warness" or "golfness" - those are principal components, or factors. You compute and save as many factors as you want - presumably less than the number of original terms. (Apologies to any wandering mathematicians for the gross simplications.)

Latent Semantics was invented at Bellcore in late 1980s, by an All-Star information science team including Tom Landauer, George Furnas, and Sue Dumais. (IP alert: It's patented.) There's a more recent description from Tom Landauer's university group here (PDF) and another overview under Creative Commons license here.

The algorithm I hand-waved up above is called 'singular value decomposition.' Back in the early 90's I and a team did a project called Guides (PDF) using the latent semantics technique. Since it was a small problem and we had access to free cycles from a Cray, we bludgeoned it to death with an off the shelf math library. For real world use, though, you need a variant of SVD that takes advantage of the sparseness of the document matrix. One of the world's experts on sparse SVD is Dr. Michael Berry at UT Knoxville. Back when we were sending him some funding to port his algorithms so we could use them at Apple. And it looks like he's still banging away on reducing the computational costs ten years later. The good things about latent semantics with the SVD include:

It's a fully automatic algorithm, no hand-engineering of a knowledge model is required.
It computes a mathematically provable least squares best fit model to the data you feed in, which is unlikely with a hand-engineered or even automatically generated Bayesian network. Both techniques try to generate a reduced dimensional space in which most of variance of the original data is captured - SVD guarantees it's the best one.
Running the latent semantic model for retrieval is fairly cheap computationally, and increasing and decreasing the precision of the fit is very straightforward.

And the flip-side:

SVD is computationally intensive, and updating the solution to reflect new data as it's gathered still seems to be a research area.
It is actually rather difficult to engineer in prior hypotheses on how the latent semantic model should be structured, even if you have them.
The automatically generated 'factors' often don't lend themselves to use in explaining the results to humans. From experience, the first few factors are likely to be something you can explain in English, but go out to the 20th or so, and you'll find something like (figuratively) 'take two parts of Colin Powell, subtract off some Rumsfeld and a bit of Condi Rice, add in a dash of Tony Blair, that gives you the factor score'. What tha...?

So what's Apple doing with latent semantics to catch spam? Not sure. The simplest approach is to use a related factor analysis technique to find the best fit to predicting spam/not-spam in a training sample; it's not a full PCA but I suppose you could call it latent semantics. It would be more interesting if they are using the full deal, maybe computing separate models for spam/not. Because, you see, latent semantics naturally lends itself to automatic sorting and organization of the document space over which the model was computed. And a few years back, the Apple group that I and then Dan Rose managed did an automatic e-mail organization project rather unfelicitously called piles, that included a user interface for just such a thing. (IP alert: Some of it was patented.) Hmmm.... (BTW, if any of the folks involved in Chandler are interested, I will spill my guts on both that project and 10 years more of private speculations for a remarkably small amount of microbrew or decent wine suitably applied.)

So which one works better for spam? Beats me. My own and my wife's private e-mail are on Earthlink, which uses Brightmail, and it leaks like a sieve. (Did I mention we have a portfolio company in this area? Heh-heh.) I'll be making the plunge into OS X Mail in a few days myself, so maybe I'll have an A/B comparison after a while. My suspicion is that either can work reasonably well, the biggest problem is tracking the moving target presented by the (expletive deleted) spammers, forcing continual retraining or modeling.
7:35:31 PM

Off Topic: Sumo Exports

The PejmanPundit has some questions about the ceremonies surrounding sumo matches. Since I've noticed some .jp visitors wandering by here, would someone go over and help him out? I'm interested, too.

And while we're on the subject, if any of you have pull with the powers-that-be who own the TV franchise on the basho's, could you get them to send us poor gringos something better than the blipvert-speed, months out of date stuff we get now? With an English-language commentator who knows his stuff - I know they exist, I've watched them while killing time in Tokyo hotels. It could be a franchise - Americans are proven to like big guys crashing into each other. Some cultural imperialism, please!
12:09:49 PM

~~File~~MP3 server to go?

Gordon Mohr at O'Reilly describes a Sony device just being released in Japan: Handheld, 390 grams, WiFi enabled, with a 20GB embedded Linux powered file server. US price would be about $565 if released here at the same markup.

Now considering that a quick check at Outpost shows me an SMC 802.11b AP for $50 after rebate, and $65 for a bare 20GB drive, that would seem to say that an MPU, a Linux kernel, and some integration and packaging are supposed to be worth $450. So what's the deal?

It's the packaging that might count here. Sony is likely trying to do an Apple. After all, they get folks to happily cough up $500 for a 20GB drive packaged into an iPod (with no wireless), but with a brandname. Both Sony and Apple have that. And we all know what's going to be stored on that Sony gadget when it goes to a party, right? I'm sure they'll love it over at Sony Music.

Via Mochio Umeda
11:00:16 AM