paperpools: google

Showing posts with label google. Show all posts

Friday, April 21, 2017

Google Books reviewed

After the settlement failed, Clancy told me that at Google “there was just this air let out of the balloon.” Despite eventually winning Authors Guild v. Google, and having the courts declare that displaying snippets of copyrighted books was fair use, the company all but shut down its scanning operation.

It was strange to me, the idea that somewhere at Google there is a database containing 25-million books and nobody is allowed to read them. It’s like that scene at the end of the first Indiana Jones movie where they put the Ark of the Covenant back on a shelf somewhere, lost in the chaos of a vast warehouse. It’s there. The books are there. People have been trying to build a library like this for ages—to do so, they’ve said, would be to erect one of the great humanitarian artifacts of all time—and here we’ve done the work to make it real and we were about to give it to the world and now, instead, it’s 50 or 60 petabytes on disk, and the only people who can see it are half a dozen engineers on the project who happen to have access because they’re the ones responsible for locking it up.

James Somers at the Atlantic on Google Books, the whole thing here.

Wednesday, October 31, 2007

This morning I did not even bother to clock into Fogbugz because, I forget, anyway it suddenly occurred to me that I had not checked out the Juice Analytics blog for a long time. So I checked out the blog and the most recent entry was a set of tips for a maps mashup, with a link to another maps mashup, both of which I have passed on to Hassan who is jobhunting so probably has no time (but looking at the tips I thought I might be able to tackle this myself if the world of gainful employment claims HA). In the sidebar was a link to Michael Lavine's free e-book Introduction to Statistical Thought with examples using R (it has been out since December 2006, so I'm a bit late, but it looks great and this is really not too bad). There was also a link to a video, Inbox Zero, a talk given by Merlin Mann to Google. I do think I need to do something about Inbox management, but this has to be the worst advice I have ever seen - bad advice for anyone with an Inbox, and exceptionally stupid given the audience (who are, at the risk of stating the obvious, exceptionally smart people). I give you the link so you can judge for yourselves if you want to:

Inbox Zero

Mann's advice amounts to this: you should try to check e-mail less often (turn off Autocheck if you have it (I don't), try to keep checking down to once an hour), and act on all incoming e-mail as soon as you check it, with one of the following responses:

Delete (includes Archive)
Delegate
Defer
Respond (preferably with a maximum of 5 sentences)
Do (take an action that solves the problem)

All these responses, as I say, should be done immediately so that the Inbox is not used as a To Do list.

Now, Mann does not talk about the problem of a Drafts folder whose contents are up in the triple digits, and you might think someone whose Drafts folder is overflowing is even worse than someone with an Inbox that's overflowing. That's not really true.

As any fule kno, Google has a policy of allowing its very smart staff to spend 20% of their time on a personal project. What this means, of course, is that each member of staff is spending 20% of his or her time on a cool project which he or she hopes ONE day to be spending 100% of his or her time on - a cool project that will take off. But if you work on something new you often have to get advice or information from other people - and each time you have some new brilliant idea the temptation is to fire off an e-mail to someone who might have the answer.

This is ordinary practice even for someone like me, who works independently. You surf around online, you discover the existence of someone you don't know from a bar of soap who would naturally be only too THRILLED to help, dash off a quick e-mail or maybe just capture the e-mail address for future use... If you're at Google, though, this element of working on a project is also (I assume) a way of persuading other people that the project is cool, getting other people excited about it so that it stands a better chance of being adopted by Google. Well, obviously, if EVERYONE at Google is spending 20% of their time on a pet project they think incredibly cool, the potential for everyone to be buried under an avalanche of e-mails is very high - and that's before you take into account the 80% of time that's spent on Google-approved projects.

What this means is that one very good way to keep the general volume of e-mail down is for people to be much more ruthless, not with the e-mails that come in, but with those that go out. You have a brilliant idea, you want quick answers, you dash off an e-mail - and put it in the Drafts folder. Sometimes the answer to the question turns up in a few days or a week. Sometimes you go back to the e-mail which is full of last week's brilliant idea and meanwhile you have moved on to another brilliant idea. If you have a lot of these ideas you will end up with a Drafts folder with 177 e-mails, but you have kept them out of someone else's Inbox. You can be selective about the people you do write to, you can think through your questions properly, you can provide whatever information is necessary so people can answer once rather than engage in e-mail ping pong.

Mann thought one way to keep volume down was to write brief replies. Someone sends me a 25-paragraph e-mail, I can't write a 25-paragraph reply... This is actually silly. If a point or question can be made briefly then of course there is no virtue in length. If an issue is complex, though, if several options must be considered, each with different implications, it is more helpful to have everything set out in a single message. Each paragraph should not need a paragraph in reply; a good reply will respond point by point, with perhaps a sentence or two per paragraph, interpolated into the original text. You then have all relevant information and responses in a single document, which is much easier to consult if you have to go back to it in 6 months than a series of e-mails with the same title prefaced by Re: Re: Re: Re: Re:

There are people who can't cope with anything more complicated than an exam with T/F and multiple choice questions. It's very hard to do business with that kind of person, because you have to boil everything down to the type of question that has simple answers. Things go horribly wrong because you can't discuss problems at the level of complexity required. I find that as a writer of fiction; it strikes me as unlikely that high-powered software developers have simpler problems than mine.

At the beginning of his talk Mann said one should look at how one spends one's time and think about how well it matches the priorities one claims to hold. One might claim, he said, that family and religion mattered most; if those were one's priorities, were they reflected in the distribution of e-mails? The time spent on e-mails rather than other things? Again, this was someone who had not bothered to spend three seconds thinking about what Google claims to be about.

We may certainly feel that the way Google handled its dealings with China is at odds with the moral position it claims to hold. It's still not unreasonable to think that many people working there think they can come up with ideas whose implementation will make the world a better place. They don't necessarily put religion high (maybe anywhere) on their list of priorities, but they might put making the world a better place high on the list; quite a lot of them might subscribe to the hacker principle, for instance, that the world would be a better place if good solutions to problems only had to be discovered once for everyone to have access to them. It's not clear that the distribution of e-mails in Inboxes (assuming a good spam filter) would be wildly out of sync with this; the challenge would be to make this form of communication more effective in promoting goals it to some extent already serves.

I don't know whether there is a one-size-fits-all system for managing e-mails. I don't know whether the recommended system would be more helpful in a workplace where life was what happened outside office hours. It was interesting to see what a bad fit was achieved under the assumption that this was the only possible type of workplace.

Wednesday, September 19, 2007

search engines and such

On Language Log, Barbara Partee comments on a counter-intuitive set of Google search results:

Google peculiarities: When I tried to get a rough Google comparison of "biggest * of any of the other" vs. "biggest * of any of the", I actually seemed to get a much bigger number for the first, though it should be a subset of the second. I got 106,000,000 for the first and just 12,800 for the second! But then with some help from Kai von Fintel and David Beaver, it was discovered that Google behaves very strangely with some ungrammatical strings. Closer inspection of the return from the search that seemed to give 106,000,000 hits shows that it returns only 3 pages of results, with the number 106,000,000 at the top of pages 1 and 2, but the number 21 on page 3, and in fact it only returned 21 hits.

David sleuthed out the phenomenon; here's his report.

***********

Unfortunately, the numbers given as results of google searches have become less meaningful over the last few years rather than improving in any sense relevant to us. The numbers google gives in response to a query are not counts of the number of pages with the given string. Rather, they are estimates based on a formula that, so far as I know, is not public. For simple searches, the estimate is presumably based on a calculation of the probability of the page having all the search terms based on the number of pages in the google caches for each of the component terms. But once you start doing string searches, this sort of approach becomes very unreliable.

I assume that the oddity of the result for "biggest * of any of the other" occurs because Google doesn't have any smart way to calculate the likelihood of strings for which the number of responses appears too large to simply count them. That is, I guess the algorithm works by first putting some bounds on the likely number of hits based on e.g. how rapidly various google network nodes appear to be sending responses, and if that number is sufficiently small, then google uses some fairly accurate algorithm for estimating the total, like counting every single response. But if there appear to be loads of responses, then the algorithm makes an estimate based on, well, who knows what. In the case at hand (and similarly for "smallest * of any of the other", "largest * of any of the other"), the estimate assumes some distributional properties that just don't hold for semantically or syntactically anomalous strings. Then, as you start going through the hits, Google is forced to self-correct as soon as you force it to actually enumerate all the results.

Hmm. So, if I'm right, then Barbara has stumbled on a rather interesting test for grammatical anomaly (though only relative to Google's bizarre assumptions about normality). Lets try another case: "* who thinks that is happy". This one has pretty damn ordinary set of words in it, but suffers from an unfortunate case of a missing subject. Here Google initially estimates 10,900 results. But then it rapidly revises down to 16...

[article with the questison that started this off here]

Meanwhile Joel Spolsky has a prophetic post claiming that Gmail will be the WordPerfect of e-mail here.

Thursday, August 2, 2007

the little prince

The diffusion theory here comes from the German sociologist Simmel. This says that adoption runs like a pig through a python. The earliest adopters take hold of the pig and then three things happen.

1) The later adopters go, "Pig! Yes, please. Now that I know about it, and now that it has been approved by my betters, I would very much like some pig."

2) The early adopters go, "Oh, please. Now that our lessers are consuming pig, we're not interested" and they bail out.

3) Eventually, the later adopters notice that the early adopters have bailed, and they bail, too.

Thus does a bump run through the python. As each later group adopts, each previous group repudiates. (Of course there are always extenuating circumstances. Adoption is also decided by the value created by competing parties. Simmel's theory accounts only for the effects of admiration and imitation.)

Now, the marketing community is keenly interested in buzz, word of mouth and the tipping point. But many marketers seem to believe you get to keep the early adopters. They act as if the python keeps filling up from one end to the other. In their view, apparently, the adoption process is not a running bump. It's a filling up.

THIS SITE SITS at the Intersection of Anthropology and Economics on Google