paperpools: R

Showing posts with label R. Show all posts

Thursday, April 9, 2020

Continent cut off

Am writing on my sister's laptop, as mine is on the fritz and I can't even find out whether it can be fixed because Maryland is under lockdown. Tried to leave a comment on my last post (ON MY OWN BLOG) and Google would not let me do it (despite the fact that I am, as you see, signed in as me and allowed to publish new posts).

It wasn't much of a comment, but anyway, Andrew! Such a great post!

It may be some time before I post again, as the Governor of Maryland is having trouble keeping order in class. He has now announced that if people don't cut down their visits to grocery stores he may keep us inside until September.

Monday, April 6, 2020

Interview of Hadley Wickham (woot)

The real purpose of this blog, you may not have realised, is to have a place to put things where I can conveniently find them again. Am in the middle of an interview of Hadley Wickham by Will Chase, an interview in which HW (we are not worthy) says:

So I think for a long time there was this big pool of people that could potentially be contributing, but they were really put off by R-help (note: R-help is a notoriously hostile mailing list and was the only way to get help with R in the early days). And then the timing was lucky enough that there were two significant changes that allowed the community to reinvent itself to some degree.

The first of those was StackOverflow. It seems hard to imagine now, but at the time, StackOverflow was so incredibly welcoming and friendly. And I think part of that was that in contrast to R-help, anything would seem welcoming and friendly,

This made me laugh, because I had spent countless hours trawling through installments of the R-help mailing list, and the principal contributors of answers (Brian Ripley, Uwe Ligges, Duncan Murdoch, Peter Dalgaard, others I could once have named without thinking) were often very severe. But after one had trawled through HUNDREDS of installments one couldn't help but be struck by the generosity of contributors who kept answering question after question for months, years on end. To this day I feel an affection for Ripley, Ligges and all (the mere name Uwe Ligges has only to come to mind to make me smile), an affection yet to be inspired by professional contacts who are EXTREMELY friendly and dodge questions like so many bullets.

The whole thing here (this is, of course, the link I want in a convenient place).

Monday, May 30, 2016

Tufte in R

Amazing resource from Lukasz Piwek, here.

Sunday, July 26, 2015

The Hadleyverse

Most of my time, these days, is taken up with dealing with the * it would be professional suicide to report.

Not so sure discretion is the better part of valor, but there are wheels within wheels within wheels.

Meanwhile I am tormented by an alternative universe. The Hadleyverse.

This is probably a case where ignorance is bliss. If you don't know about the Hadleyverse, you are probably better off not knowing. If you're not already programming in R, if you're not already getting the hang of ggplot2, if you haven't installed RStudio and been repeatedly dazzled by all the amazing things you now can do that were a headache in basic R -- if you haven't THEN been seduced by Shiny, if you've never grappled with data and then discovered that Mr Fabulous has released a package or 3 that addresses most-to-all of the problems that were driving you berserk--

If you've never followed this path you probably DON'T want to know more about Hadley Wickham. You're mired down in a professional environment where solving problems is nobody's business. It may well be driving you crazy, but at least you don't have an example of a world where things can be different.

So this is bad, bad, bad, very bad, but there's a terrific profile of Hadley Wickham on Priceonomics, which I am now sharing with people who might well be happier if they knew nothing about it. The whole thing here.

Wednesday, November 26, 2014

not enough letters

The general problem for lack of analytic talent, and for employers as well as employees wasting tons of time in fruitless job interviews, is well illustrated when you compare resumes and job ads below. One of our readers (look at the comments below) mentioned that the skill R (one of the too most popular programming languages used by data scientists, the other one being Python) is never picked up by automated search tools used by recruiters to parse resumes, because it's just one letter. So it does not matter if you have R or not in your resume, if the hiring comparing uses poor automated filtering tools to narrow down on candidates with desired skills, such as (especially and ironically) R.

Vincent Granville, Why Companies Can't Find Analytic Talent

Thursday, August 11, 2011

useless but cool (ars gratia artis)

One R Tip a Day offers a package which enables one to display one's favorite strips from xkcd. Here.

Thursday, August 12, 2010

Sapir-Whorf and ggplot2

To some degree, we are constrained in our ability to solve problems if we only know a single language. This situation has been recognized different ways by the programming community. The Logo programming language was built based upon constructionist learning theory and was intended to provide a “mental model” for children to come to understand mathematical constructs. In recent times, many programmers have committed to being polyglots, learning new languages as a part of professional development. Their concern is not always to learn the latest language that they will need to work, but to find out new ways of conceptualizing problems and structuring solutions.

R Chart on language and thought and ggplot2

ESPN's Bill Simmons (aka The Sports Guy) recently suggested that the primary cause of dwindling interest in Red Sox games by fans is that baseball games these days are too long. "It's not that fun to spend 30-45 minutes driving to a game, paying for parking, parking, waiting in line to get in, finding your seat ... and then, spend the next three-plus hours watching people play baseball", he says.

Erm, I always thought the reason I thought baseball games were too long was that I was not interested in baseball. Had not considered the possibility that a 3-hour game might put off people who actually liked the game.

Revolutions (New about R &c) offers a plot in ggplot2 to determine, anyway, whether the data support the claim that games are getting longer.

Thursday, November 19, 2009

R in Action

On One R Tip A Day, an early review of Ron Kabacoff's R in Action.

Tuesday, July 7, 2009

head to head

Over on Learning R, the intrepid RLearner is going through Deepayan Sarkar's book on data visualization using Lattice and replicating the graphics using Hadley Wickham's ggplot2. It's completely enchanting.

Tuesday, March 24, 2009

survival analysis using R

Survival Analysis

A great many studies in statistics deal with deaths or with failures of components: they involve the numbers of deaths, the iming of death, or the risks of death to which different classes of individuals are exposed. The analysis of survival data is a major focus of the statistics business (see Kalbfleisch and Prentice, 1980; Miller, 1981; Fleming and Harrington 1991), for which R supports a wide range of tools. The main theme of this chapter is the analysis of data that take the form of measurements of the time to death, or the time to failure of a component. Up to now, we have dealt with mortality data by considering the proportion of individuals that were dead at a given time. In this chapter each individual is followed until it dies, then the time of death is recorded (this will be the response variable). Individuals that survive tothe end of the experiment will die at an unknown time in the future; they are said to be censored (as explained below).

from Michael Crawley's The R Book (currently available in hardback on Amazon for a mere $88, 20% off the cover price of $110)

wie es eigentlich ist

If you measure the same thing twice you will get two different answers. If you measure the same thing on different occasions you will get different answers because the thing will have aged. If you measure different individuals, they will differ for both genetic and environmental reasons (nature and nurture). Heterogeneity is universal: spatial heterogeneity means that places always differ and temporal heterogeneity means that times always differ.

Because everything varies, finding that things vary is simply not interesting. We need a way of discriminating between variation that is scientifically interesting, and variation that just reflects background heterogeneity. That is why we need statistics. It is what this whole book is about.

The key concept is the amount of variation that we would expect to ocur by chance alone, when nothing scientifically intersting was going on...

....when nothing really is going on, then we want to know this. It makes life much simpler if we can be reasonably sure that there is no relationship between y and x. Some students think that 'the only good result is a significant result'. They feel that their study has somehow failed if it shows that 'A has no significant effect on B'. This is an understandable failing of human nature, but it is not good science. The point is that we want to know the truth one way or the other.

from Statistics: An Introduction using R, by Michael J. Crawley, a book which it is already impossible not to love.

Sunday, March 15, 2009

Looking at Data course

Courtesy of Flowing Data:

Hadley Wickham (creator of the R package ggplot2) will be giving a 2-day course in Washington, DC, Looking at Data, on 30 and 31 July. Day 1: Static graphics; Day 2: Dynamic and interactive graphics. $295 for one day, $550 for both, $100 per day for students. Full details here.

Morrrrrrrrrre R

Kevin Connolly, author of the terrific Excel plot on language learning, has just send me a link to a MetaFilter page, Rrrrrgh, with many links previously unknown to me. There's Kickstarting R from Jim Lemon on the R site:

Kickstarting R was initially compiled to help new users by requesting accounts of "... things that drove you crazy the first time you used R"....

Concluding that most of the introductions for beginners cover this well,

I decided to concentrate on providing a few solutions for tasks that new users would be likely to face. The reader I have in mind is one who has just installed R and is asked to produce the usual listing of descriptive stats and plots from a data file that is in an arbitrary format. A simple job, if you already know how to use R. I hope that this revision will be even better at helping new users get started.

So for readers who have some data and have been wondering whether they could use R to play around with it, this could be just the ticket.

There's Revolutions (a blog with "news about R, statistics, and the world of open source from the staff of REvolution Computing"). There's the blog OneRTipaDay. And many many more. (Also a link to RSeek.org, a customized version of Google for those who want to search for R-related sites; typing "R" into the vanilla Google search field produces, as you can probably imagine, mixed results.)

Checking out OneRTipaDay, I discover that the NYT actually ran a piece on R back in January (how did I miss this?) - a piece with photos of Robert Gentleman and Ross Ihaka, who released the first version of R back in 1996. The whole thing here.

Thanks, Kevin!

Sunday, March 8, 2009

inforr

I decided to ride my bike to the gym the other day; it wasn't locked up outside. 'Oh,' I thought, 'it must have been stolen.' It was only at this point that I realised how much I had always disliked it, and what a relief it was to have it taken off my hands.

Walked to the gym. Half an hour or so later I suddenly seemed to remember riding the bike to Cafe Kleisther days? weeks? earlier and locking it up outside Kleistpark U-Bahn. Went down to Kleistpark after the gym. Sure enough the bike was there, with an empty Burger King bag in its basket. Went to Kleisther for a coffee and forgot the bike again. Now that I've remembered it again I might go and pick it up.

It's not really safe to ride it if I have books in my head and am also dealing with a lot of Cheshire cats (normal publishing people). If I have to think through a book, and if people make nice noises and I am suddenly left with a lot of grins hanging in the air, there isn't enough mental capacity left to stay out of traffic accidents. So then I just end up walking the bike for miles, or leaving it at home.

I woke up this morning and thought how good it would be to stay in bed; how many bad things could have been avoided if I had taken the simple precaution of staying in bed. How many e-mails had better remained unwritten, unsent; &c. I had spent much of yesterday, though, installing MS Office 2007 , R 2.8 and a couple of packages on my netbook, as well as activating my one year's free subscription to Inference for R; I had also downloaded all the documentation onto my Mac so I could read it on a separate screen while manoeuvring around the tiny screen of the netbook. So if I got out of bed I could try out Inference for R.

Got up. The netbook is, needless to say, not my weapon of choice for playing around with a plug-in for Word and Excel that lets the happy user incorporate R code into Office documents, but the old Sony Vaio would not install Office 2007 without some extra Windows Service Packs that I had no way of downloading, 2C2E. I set up an array of laptops on the desk and start working through the QuickStart documentation.

Well, what can I say? I worked through various examples in the uniformly excellent documentation; sure enough, various Word and Excel files obligingly did nice things with R. I then reached a point at which an example incorporated a plot in Lattice: one followed instructions, and there it was, a beautiful Lattice plot right in the middle of a Word document! Exactly what I had always wanted!

Exactly what I needed for my Guggenheim project!

I shall probably have more to say about this when I've had time to work through some of my own examples; can only comment at this point that it is really not very common to start out on new software with the feeling that one can get straight to work on a project whose problems it solves, without putting in who knows how many hours getting the thing to work.

Thursday, March 5, 2009

meanwhile

I get an email from Ben Hinchliffe asking if I would be interested in a year's free trial of Inference for R, a plug-in which allows you to haul whatever it is you've been doing in R into Microsoft Office.

Fresh from the Ed Park meltdown (2C2E) I've been thinking that what I should do is go back to the States to live with my mother and get a job at Waffle House: 1. I love Waffle House; 2. There are 4 Waffle Houses in Frederick, Maryland, not far from my mother's home in Leisureworld; 3. Leisureworld always conjures up a combination of the Stepford Wives and Westworld, a film in which Yul Brynner runs amok as a robot cowboy; but 4. There are 4 Waffle Houses within striking distance, how bad can it be? 5. TAR ART RAT says his grandfather lives in Leisureworld and is having a rare old time.

I then read an article in the New Yorker about DFW's last book, a book about someone who works at the IRS. For DFW, this was the ultimate in boredom; he was trying to grapple with the concept of a life in which grace was found in the heart of boredom. I want to argue with him: DFW, you just don't get it. You don't know what it's like to have a quote-unquote job where the work you put in writing a text is nothing, the thing that pays the bills is the work you add to this nichts-an-sich by charming and waiting and charming and waiting and charming. What would that be like, to turn up and put in 8 hours' work and get paid for the 8 hours' work, instead of putting in 3000 hours' work and not getting paid because you could only add a paltry month of charming and waiting and charming instead of the three years that were industry standard? (I know what it's like, obviously; this is what makes Waffle House look so good.) I go to the IRS website to check out employment opportunities; that sounds so good, that sounds so good.

But all this time Inference for R has been lurking at the back of my mind. According to Ben Hinchliffe, it works on Windows, Microsoft Office 2003 Professional, Microsoft Office 2007 all versions. I only have the Home/Student edition of Office (for PC and Mac); I normally use a Mac; but if I pick up a copy of Office 2007 I can try this out! I remember seeing a cheap copy at Saturn; I take the 204 bus to Kurfurstendamm, pick up Office 2007, head home, stop off at this café.

The battery in my laptop is about to die.

Last seen wandering vaguely
Quite of her own accord

Thursday, October 23, 2008

intercostal clavicle

My mother is visiting Berlin. I've been preparing my apartment to sublet. Much running about. The second day of my mother's visit I was putting coal in the stove when the doorbell rang. It was the Post. A Pakett. I sign for it, tear it open - it's my intercostal clavicle! Well, OK, not really. It's Deepayan Sarkar's Lattice: Multivariate Data Visualization with R! I did not have the nerve to try to blag a review copy off Springer, given my bad habit of putting reviews in the drafts folder, so I ordered one online and now it is here. Have been getting my mother up to speed with e-mail and downloading new packages for R. Was reading Andrew Gelman's splendid Red State Blue State Rich State Poor State: Why Americans Vote the Way They Do before I went to Oxford and then got caught up in preparations to sublet / see mother and suddenly realised I was unlikely to write about it before the election, which is appalling but can't be helped, so I point readers shamefacedly to AG's blog...

Sunday, August 3, 2008

more R

Have just found a new online introduction to R, Quick-R, which looks terrific. I have stacks of print-outs of R documentation littering the apartment, but there's something demoralising about leafing through a PDF - not least because, with the passage of time, the most frequently consulted pages tend to go missing. (I'm constantly coming across stray pages of R documentation while looking for keys, credit card, gym card, passport...) I do also have a couple of books, whose pages don't go missing - but if one wants to use any code one either has to type it out by hand or go to the relevant website. So Robert I. Kabacoff has done all beginners and improvers a big favour.

Thanks to Kabakoff I learn that Deepayan Sarkar's Lattice, Multivariate Data Visualization with R has just been published by Springer. Also great news.

Monday, May 19, 2008

Best excuse ever

My dear friend Rafe Donahue has sent many helpful suggestions with regard to ggplot2. He comments:

Don't think of the ± of 0.2 to each of the data at the repeat
locations as modifying the data.  You are not modifying the data, you
are providing a different plotting algorithm for those points that share
a location with other points.

a line of argument which it is hard not to love. I'm not saying it's not valid, no... and yet I see myself, down the years, explaining innocently that I was not actually modifying the data as such, I was just providing a different plotting algorithm etc. etc.

Meanwhile Hadley Wickham very kindly sent the correct line of code to change the y axis. I type this in, and by the simple procedure of providing a different plotting algorithm for those points that share a location with other points produce

which really is terribly nice. Further information on ggplot2 is available here.

Readers who have not spent much time with Excel charts may be inclined to accept uncritically the complaints of PP; a wealth of information on what can be achieved is available at Peltier Technical Services, here.

On the subject of providing a different plotting algorithm for data whose points overlap, Rafe reminds me that

As I said before, we did this in the
baseball plot data y putting them in little boxes.  Of course, you need
to decide on the size of the box, etc, but that is the price to pay.

For those who missed the great bivariate baseball score plot the first time round, this enables the user to select a team and one or more aspects of its game history and generate a bivariate plot, for example

You can create your own plots here.

Sunday, May 18, 2008

losing the plot

In the previous post I mentioned the results of a couple of tests on memorisation of articles in a language class. A reader pointed out that the plots given separately could be combined in a bivariate plot. Too true. This was weighing on my mind even as I posted the simple pair of plots in the last post.

The results, I remind you, were:

Test 1: 2 6 6 7 8 8 9 9 13 14 14 15 15 16 16 17 18 19
Test 2: 18 20 20 20 20 18 20 19 19 18 17 20 20 19 20 20 19 19

Since the data are currently in Excel, I run them through the chart wizard to generate a scatter plot and come up with:

which is not at ALL what I want. Why does the y axis start at 16.5? Why is it broken down in increments of .5, when the number of correct answers was always an integer? I want a chart that shows the area that's blank because NOBODY got a second score below 17.

The Excel Chart Wizard does not offer the option of customising the y axis. Since I always expect the worst of Excel, I assume there is nothing to be done. In my hour of shame, I come up with the dodgy solution of, ahem, adding a dummy set of results at 0,0. This produces

which is an improvement, but I am, needless to say, deeply mortified by the false result at the lower lefthand corner. I suddenly think: But what if I double-click on the y axis! Sure enough this brings up a dialogue box which lets me set the y axis to my own specifications, and...

Ha!

There's just one slight problem. I know this tiny database, which means I know that 9 people got 20 on the second test, whereas the line at 20 shows only 7 results. The chart has fallen victim to overplotting; the two people who got 6 on Test 1 and 20 on Test 2 have been collapsed into a single dot, as have the two who got 15 followed by 20. Excel has come through once, but I can't believe there's a way to jitter the plot points. I retreat cravenly to inserting bullet points by hand in the basic grid:

This is clunky, no doubt about it, but since I've been doing it all by hand it's easy to see the two people at 6 who got 20 and the two at 15 who got 20:

I then realise that I can achieve a similar result in the charts feature by tampering, yet again, with the data: if I replace the pairs (6,20; 6,20) with (5.8,20; 6.1,20) and use the same dodge on 15 I come up with

Good. Good. (I mention all this because Excel is what most readers are likely to have in the home; it's easy to assume that feeding data into a chart will generate a chart that displays all the data.)

At this point, needless to say, I do not feel happy about a chart that depends on fudging the data. I now do what I should have done in the first place, which is to take it all into R. How much better it would all look, I think, if I used Hadley Wickham's ggplot2 package!

So I put the data into R. Vanilla R produces a plot which throws up a y axis that starts at 17 and moves by increments of .5 to 20, which means it is necessary to rifle through much PDF documentation (which is, of course, why I did not take this very simple task to R in the first place). ylim produces the right axis but doesn't look very nice, so I load ggplot2 and get this

which is very pretty but has yet another y axis starting at 17 and going up to 20 in increments of .5. There passed a weary time, each tongue was parched and glazed each eye, in other words ylim does not do the business in ggplot2, some other method of tinkering is called for, I spend much time rifling through the documentation of ggplot2 both in PDF and at geom_point
(I knew this would happen) trying to work out what to do. Wickham's work is inspired not only by Tufte but by Lee Wilkinson's Grammar of Design, which means that the documentation discusses the rationale underlying the package, which is, of course, both interesting and admirable but unhelpful if you just want to know how to do in ggplot2 what ylim does in vanilla R. Finger in the page. geom_point does make it easy to jitter, so I try that out and get

which is actually not what I want at all, because I only want to jitter the four points where there is overlapping. I think there is a way to fix this (I think it is possible to select horizontal jitter), but how late it is, how late.

At this point, naturally, I begin to wonder whether it is not somewhat infra dig to put all this low-level milling about on display; how much better just to relegate it all to the drafts folder! Wait till I have worked through ggplot2 properly and at some later date post a series of handsome plots, drawing on a more interesting range of data sets, with an air of effortless ease. Yes.

(I revert to my paltry little Excel chart. Wouldn't it be better to have gridlines that divided the area in four? Would it be better if the numbers on the x axis were closer to the points, i.e. at the top of the plot?

Well, maybe. It's clear that about half the participants got under half the answers right on Test 1, and everyone got better than 75% right on Test 2, so that's quite nice. And it does look somewhat like a Smeg refrigerator into the bargain.)

One problem with writing novels is that you often find that there is some software somewhere that looks as though it might do some specific thing that you need for some particular chapter, which may well never be needed again. So you find yourself simultaneously at the embarrassing amateur stage of, who knows, maybe 10 or 15 different programs. So what you would really love to have is the literary equivalent of a director of photography - a technical advisor whose job it is to answer questions like 'How do I fix the axes in ggplot2?' But this is really at odds with the whole Weltanschauung of the publishing industry. But enough, enough.

I then think, but maybe it would be nice to see the two sets of data in a line plot. I am somewhat demoralised by my adventures with ggplot2, so I run them through Excel and get

which is, of course, hideous.

But also enlightening.

Participants got 3 minutes to learn the genders of 20 words. Pre-technique, half remembered fewer than half. Post-technique, half remembered 100%; all remembered 80% or better. ONE PERSON, who started with a score of 19, failed to raise the score. In a word unknown to the immortal bard, blimey.

I don't know how well they would have performed if they had been tested again after half an hour, or 5 hours, or 5 days; this is, one would have thought, an obvious question, but it was one that was not answered in the class.

Meanwhile, behind the scenes... I draft an e-mail to Hadley Wickham, pleading for help. I then realise that my dear dear friend Rafe Donahue, despite his exasperation with the sort of person who is seduced by pretty plots, is still my dear dear dear dear friend. I send an e-mail to my dear friend...

And meanwhile, what to my wondering eyes should appear, but a newsletter from Linotype celebrating the birthday of Adrian Frutiger. I mooch around the Linotype website, checking out the Akira Says column: the most recent essay is on Frutiger, but there are also essays on dashes (hyphens, en-dashes, em-dashes), small caps...

And I am OUTRAGED

because the thing is, when you see a book into print, an 'expert' will be given a month or so to go through the text to introduce 'correct' dashes, capitalisation and so on, which the author can then spend up to 6 months trying to remove

but the thing is, let's be sane. Fine-tuning the dashes and caps is never going to achieve significant improvement in the reader's grasp and retention of the text. When I say 'significant' I'm not poaching on statistical preserves, I'm talking about the kind of improvement displayed in a pair of tests on memorisation of gender. Text A gets 50% of readers, we'll say, getting things wrong 50% of the time; improved Text B gets 100% of readers getting things right 80% of the time or better. You're not going to see that, because, um, text has an extremely limited capacity to convey information in the first place. Whereas, of course, if you start with two sets of numbers

Test 1: 2 6 6 7 8 8 9 9 13 14 14 15 15 16 16 17 18 19
Test 2: 18 20 20 20 20 18 20 19 19 18 17 20 20 19 20 20 19 19

and convert them to some kind of graphic display (as above), you can dramatically improve your chances of conveying a pattern of change. And if the graphic display has the allure of a Smeg, it will dramatically improve the chances that the sort of reader who has hitherto loathed graphs will suddenly be downloading R, braving PDF documentation, collecting data on self, friends and relations for the sheer entertainment of turning it all into graphs.

The point being, if publishers hired statisticians instead of copy-editors and designers, so that the author spent a few months going over the text with a statistical expert instead of the sort of person who knows his en-dashes, it would still be a lot of work, but it would be worth it.

Meanwhile it's a dark, gloomy day.