Showing posts with label information design. Show all posts
Showing posts with label information design. Show all posts

Thursday, August 2, 2012

ich wünschte ich wüßte...

Cathy and Cosma both feel that knowing specific programming languages is not essential. To quote Cathy, "you shouldn’t obsess over something small like whether they already know SQL." To put it politely, I reject this statement. To apply to a data science job without learning the five key SQL statements is a fool's errand. Simply put, I'd never hire such a person. To come to an interview and draw a blank trying to explain "left join" is a sign of (a) not smart enough or (b) not wanting the job enough or (c) not having recently done any data processing, or some combination of the above. If the job candidate is a fresh college grad, I'd be sympathetic. If he/she has been in the industry, you won't be called back. (One not-disclosed detail in the Cosma-Cathy dialogue is what level of hire they are talking about.)

Why do I insist that all (experienced) hires demonstrate a minimum competence in programming skills? It's not because I think smart people can't pick up SQL. The data science job is so much more than coding -- you need to learn the data structure, what the data mean, the business, the people, the processes, the systems, etc. You really don't want to spend your first few months sitting at your desk learning new programming languages.
Both Cathy and Cosma also agree that basic statistical concepts are easily taught or acquired. Many studies have disproven this point, starting with the Kahneman-Tversky work. ..

Terrific post by Kaiser Fung (of Junk Charts and Numbers Rule Your World) - not least for thrill of discovery that Cosma Shalizi is, er, aggressively discussing...

Wednesday, August 10, 2011

Andrew Gelman on the difference between information visualization and statistical graphics:

When I discuss the failings of Wordle (or of Nightingale’s spiral, or Kosara’s swirl, or this graph), it is not to put them down, but rather to highlight the gap between (a) what these visualizations do (draw attention to a data pattern and engage the viewer both visually and intellectually) and (b) my goal in statistical graphics (to display data patterns, both expected and unexpected). The differences between (a) and (b) are my subject, and a great way to highlight them is to consider examples that are effective as infovis but not as statistical graphics. I would have no problem with Kosara etc. doing the opposite with my favorite statistical graphics: demonstrating that despite their savvy graphical arrangements of comparisons, my graphs don’t always communicate what I’d like them to.

I’m very open to the idea that graphics experts could help me communicate in ways that I didn’t think of, just as I’d hope that graphics experts would accept that even the coolest images and dynamic graphics could be reimagined if the goal is data exploration.

To get back to our exchange with Kosara, I stand firm in my belief that the swirly plot is not such a good way to display time series data–there are more effective ways of understanding periodicity, and no I don’t think this has anything to do with dynamic vs. static graphics or problems with R. As I noted elsewhere, I think the very feature that makes many infographics appear beautiful is that they reveal the expected in an unexpected way, whereas statistical graphics are more about revealing the unexpected (or, as I would put it, checking the fit to data of models which may be explicitly or implicitly formulated. But I don’t want to debate that here. I’ll quarantine a discussion of the display of periodic data to another blog post.

The whole thing here.

Thursday, July 15, 2010

I think some of the confusion that has arisen from Ed Tufte's work is that people read his book and then want to go make cool graphs of their own. But cool like Amis, not cool like Orwell. We each have our own styles, and I'm not trying to tell you what to do, just to help you look at your own writing and graphics so you can think harder about what you want your style to be.

Andrew Gelman at Statistical Modeling...

Thursday, May 13, 2010

fighting words


Yep, it has been scientifically proven: the accuracy of people in describing charts with 'chart junk' is no worse than for plain charts, and the recall after a 2-3 week gap was actually significantly better. In addition, people overwhelmingly preferred 'chart junk' diagrams for reading and remembering over plain charts. In all, the researchers conclude that if memorability is important, elaborate visual imagery has the potential to help fix a chart in a viewer's memory.


Infosthetics

Update: On Statistical Modeling and Causal Inference, Andrew Gelman draws attention to a couple of serious flaws with the argument: the plain graphs compared to the chartjunk graphs just aren't very good; chartjunk severely limits the amount of information that can be included in a graph. The whole thing here.

Monday, March 22, 2010


From Dataviz, via Revolutions.

Obama recently appointed ET to the Recovery Independent Advisory Commission. Excelsior.

Friday, July 18, 2008

...of the return

Woke up yesterday at 4.30am. Decided to go to the gym. I got a new bike the other day off Craigslist, which I've been keeping inside to prevent theft. Took it downstairs, set off for the gym, got to the gym, realised I had left the bike lock behind. Always a danger when a bike is not being actively protected from theft by its lock between times of active use.

Back to the apartment. It was a glorious day. Why not go for a bike ride? But first, why not clear the kitchen counter so it would be clear when I got back? And why not make the bed and clear the bedroom floor of boxes so it would be clear when I got back? (It's only 5.30am, after all.)

And while we're at it, why not check e-mails?

Big mistake.

In my Inbox was an e-mail from Rafe Donahue with a link to the PDF of a 102-page handout, Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics.

Needless to say, instead of going for a bike ride first and reading Principles for Constructing Better Graphics later, I'm unable to refrain from downloading and opening the document.

Big mistake.

Before I know it I'm up to page 48 and laughing out loud. I'm deep in an account of work done for a pharmaceutical company which wanted to know how many on the sales force were actually reading various monthly reports sent out to them, with a view to seeing whether the reports were improving sales. I read:



The May data month data appear to have been released on the Tuesday after
the 4th of July holiday break. With the 4th landing on a Sunday, the following
Monday was an off day. The May data month data are pushed to 50% cumula-
tive utilization in about a week as well, with no individual day have more than
the typical 20% usage.

We might now glance down to the cumulative row and note the dramatic spikes
when new data were released and the within-week declines and the dismal week-
ends, although Sunday does seem to trump Saturday.

We might also glance to the top and notice the curious case of the March data
month data still being used by at least someone after the April data month data
had been released, and after the May data month data, and so on. Someone was
still using the March data month data in early October, after five updated ver-
sions of this report had been issued! Why?

Looking at the May data month, the eagle-eye might notice some curiosities in
the May data month utilization before the May data were released. Note the
very, very small tick on approximately 19 May, and again on (approximately)
14, 23, 24, and 30 April: the May data month data had been viewed nearly two
months before they were released!?! Furthermore, now note with some horror
that all of the cumulative lines are red and show some utilization prior to the
reports actually being issued!

This glaring error must certainly point to faulty programming on the author’s
part, right? On the contrary, an investigation into the utilization database, sort-
ing and then displaying the raw data based on date of report usage, revealed that
hidden within the hundreds of thousands of raw data records were some reports
that were accessed prior to the Pilgrims arriving at Plymouth Rock in 1620!
How could such a thing be so?

[As I think I've said, I've been trying to get the LRB to let me write about hysterical realism and information design. James Wood is agin the type of novelist who wants to show how the world works instead of the inner life - but it seems to me that Principles for Constructing Better Graphics not only shows us something about how the world works and how to find out about it, but in the process shows us the inner life of (surprise!) a statistician who is interested in the graphical presentation of data. We read on, agog:]

The answer is that I lied when I told you that the database held a record of the
date and time each report was accessed. I told you that because that is what I
had been told. The truth is that the database held a record of the date and time
that was present on the field agent’s laptop each time the report was accessed.
[It's hard not to love this. ]

Some of the field agents’ laptops, therefore, held incorrect date and time values.
Viewing all the atomic-level data reveals this anomaly. Simply accounting the
number of impulses in a certain time interval does nothing to reveal this issue
with data.

But I know that it is not possible for those usages to take place before the reports
were released, so why not just delete all the records that have dates prior to the
release date, since they must be wrong? True, deleting all the records prior to
the release would eliminate that problem, but such a solution is akin to turning
up the car radio to mask the sounds of engine trouble. The relevant analysis
question that has been exposed by actually looking at the data is why are these
records pre-dated?

We see variation in the data; we need to understand these sources of variation.
We have already exposed day-of-the-week, weekday/weekend, and data-month
variation. What is the source of the early dates? Are the field agents resetting
the calendars in their computers? Why would they do that? Is there a rogue
group of field agents who are trying to game the system? Is there some reward
they are perceiving for altering the calendar? Is there a time-zone issue going
on? Are the reports released early very early in the morning on the east coast,
when it is still the day before on the west coast? And, most vitally, how do we
know that the data that appear to be correct, actually are correct???




I was, naturally, unable to tear myself away. By the time I'd finished reading the weather had changed, it was cloudy and dull. Bad, bad, very bad.

Is it too late to change careers and be a statistician? Say it ain't so, Rafe, say it ain't so.

The full report here.

Monday, March 24, 2008

from gridlock to green machines

At Daily EM (which got it from Core77), a poster from the Muenster planning department shows the amount of space required to transport the same number of passengers using car, bus, bicycle


(Bigger version here)

Tuesday, August 14, 2007

Jaegerman's graphics

Edward Tufte has a collection of news graphics by Megan Jaegerman here