Woke up yesterday at 4.30am. Decided to go to the gym. I got a new bike the other day off Craigslist, which I've been keeping inside to prevent theft. Took it downstairs, set off for the gym, got to the gym, realised I had left the bike lock behind. Always a danger when a bike is not being actively protected from theft by its lock between times of active use.
Back to the apartment. It was a glorious day. Why not go for a bike ride? But first, why not clear the kitchen counter so it would be clear when I got back? And why not make the bed and clear the bedroom floor of boxes so it would be clear when I got back? (It's only 5.30am, after all.)
And while we're at it, why not check e-mails?
Big mistake.
In my Inbox was an e-mail from Rafe Donahue with a link to the PDF of a 102-page handout, Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics.
Needless to say, instead of going for a bike ride first and reading Principles for Constructing Better Graphics later, I'm unable to refrain from downloading and opening the document.
Big mistake.
Before I know it I'm up to page 48 and laughing out loud. I'm deep in an account of work done for a pharmaceutical company which wanted to know how many on the sales force were actually reading various monthly reports sent out to them, with a view to seeing whether the reports were improving sales. I read:
The May data month data appear to have been released on the Tuesday after
the 4th of July holiday break. With the 4th landing on a Sunday, the following
Monday was an off day. The May data month data are pushed to 50% cumula-
tive utilization in about a week as well, with no individual day have more than
the typical 20% usage.
We might now glance down to the cumulative row and note the dramatic spikes
when new data were released and the within-week declines and the dismal week-
ends, although Sunday does seem to trump Saturday.
We might also glance to the top and notice the curious case of the March data
month data still being used by at least someone after the April data month data
had been released, and after the May data month data, and so on. Someone was
still using the March data month data in early October, after five updated ver-
sions of this report had been issued! Why?
Looking at the May data month, the eagle-eye might notice some curiosities in
the May data month utilization before the May data were released. Note the
very, very small tick on approximately 19 May, and again on (approximately)
14, 23, 24, and 30 April: the May data month data had been viewed nearly two
months before they were released!?! Furthermore, now note with some horror
that all of the cumulative lines are red and show some utilization prior to the
reports actually being issued!
This glaring error must certainly point to faulty programming on the author’s
part, right? On the contrary, an investigation into the utilization database, sort-
ing and then displaying the raw data based on date of report usage, revealed that
hidden within the hundreds of thousands of raw data records were some reports
that were accessed prior to the Pilgrims arriving at Plymouth Rock in 1620!
How could such a thing be so?
[As I think I've said, I've been trying to get the LRB to let me write about hysterical realism and information design. James Wood is agin the type of novelist who wants to show how the world works instead of the inner life - but it seems to me that Principles for Constructing Better Graphics not only shows us something about how the world works and how to find out about it, but in the process shows us the inner life of (surprise!) a statistician who is interested in the graphical presentation of data. We read on, agog:]
The answer is that I lied when I told you that the database held a record of the
date and time each report was accessed. I told you that because that is what I
had been told. The truth is that the database held a record of the date and time
that was present on the field agent’s laptop each time the report was accessed.
[It's hard not to love this. ]
Some of the field agents’ laptops, therefore, held incorrect date and time values.
Viewing all the atomic-level data reveals this anomaly. Simply accounting the
number of impulses in a certain time interval does nothing to reveal this issue
with data.
But I know that it is not possible for those usages to take place before the reports
were released, so why not just delete all the records that have dates prior to the
release date, since they must be wrong? True, deleting all the records prior to
the release would eliminate that problem, but such a solution is akin to turning
up the car radio to mask the sounds of engine trouble. The relevant analysis
question that has been exposed by actually looking at the data is why are these
records pre-dated?
We see variation in the data; we need to understand these sources of variation.
We have already exposed day-of-the-week, weekday/weekend, and data-month
variation. What is the source of the early dates? Are the field agents resetting
the calendars in their computers? Why would they do that? Is there a rogue
group of field agents who are trying to game the system? Is there some reward
they are perceiving for altering the calendar? Is there a time-zone issue going
on? Are the reports released early very early in the morning on the east coast,
when it is still the day before on the west coast? And, most vitally, how do we
know that the data that appear to be correct, actually are correct???
I was, naturally, unable to tear myself away. By the time I'd finished reading the weather had changed, it was cloudy and dull. Bad, bad, very bad.
Is it too late to change careers and be a statistician? Say it ain't so, Rafe, say it ain't so.
The full report
here.