[ |
mood |
| |
accomplished |
] |
Lately, I've been helping a colleague of mine with some data-mining research on the Enron email database (yeah, the one the DoJ subpoenaed from them). We're not doing anything the media will find terribly exciting -- I mean, I don't think we'll be uncovering any exciting new conspiracies or anything like that -- but we're hoping to discover interesting and/or useful things about how large organisations use email as a tool. Anyway, one of the things my colleague wanted was a directed-graph representation of senders and recipients, where each edge represents an email from the source vertex to the target vertex; edges increase in weight as the sender sends more emails to the recipient.
"This is a job for Graphviz!" I said, and since I'd been meaning to sit down and learn how to use the Boost Graph Library anyway, I sat down and got to it. It's been a somewhat bumpy ride -- the BGL does some weird, weird things with templates, and its documentation seems to be written for people who, erm, already know what they're doing -- but I am now the proud author of my very own topologiser, Klein.
Klein creates maps of networks, mainly communication networks. It reads "source" and "target" data as tuples from a PostgreSQL database; it takes as command-line parameters the database name, source field name, target field name, table name (I'm already planning to handle joins in a future version, but not now), any necessary database connexion parameters, and (optionally) a minimum edge-weight. (The latter is in case you run into the same problem we did, where the graph is so large that Graphviz can't handle the resulting .dot file and you just want to look at a higher-activity subgraph.) It outputs a file in DOT, the definition language which Graphviz uses. It requires libpqxx, which should come packaged with PostgreSQL and can also be found here, and libpopt, which is part of GNOME. It is pretty fast; analysing a graph of ~78,000 vertices and ~290,000 edges took about twenty minutes on a Pentium-4 M with 512MB of RAM, and that was with KDE running and reading the source data over 802.11b. (The resulting graph was way too big for Graphviz to handle, though.)
In the next few days, I'll throw GNU Autotools at it (which will be an adventure in and of itself) and package it up for release under the GPL. In later releases, I plan to include other data-source options, including reading from an ordinary istream; it would be terribly neat, I think, to figure out some way to read from an Ethereal or tcpdump session (preferably from a network adapter in promiscuous mode on a switch) in order to model network traffic density. I haven't thought very far ahead about other useful features, but if anyone has an "ooh, shiny" moment, I'll happily listen. :)
And, all that said, I'd be remiss if I didn't point out the role LJ has played in my evolution as a software developer. About two and a half years ago, other mentioned the existence of Graphviz, and the following semester, ernunnos pointed me at Boost; okay, maybe it took me two years to become a good enough C++ programmer to make use of it, but we all have to start somewhere. Thanks, guys. :)
|