Geeking with Greg

Friday, January 29, 2010

Gmail launches personalized ads

Google's popular mail service, GMail, has launched advertising targeted not just to the particular e-mail message you are reading, but to other e-mails you might have read recently. An excerpt:

Sometimes, there aren't any good ads to match to a particular message. From now on, you'll sometimes see ads matched to another recent email instead.

For example, let's say you're looking at a message from a friend wishing you a happy birthday. If there aren't any good ads for birthdays, you might see the Chicago flight ads related to your last email instead.

It is a significant move toward personalized advertising and, as the Google post notes, is a big change for Google, as they previously "had specified that ads alongside an email were related only to the text of the current message." For example, here, Google says, "Ads and links to related pages only appear alongside the message that they are targeted to, and are only shown when the Google Mail user ... is viewing that particular message."

For more on personalized ads that target not only the current content, but also to previously viewed content that has strong purchase intent, please see my July 2007 post, "What to advertise when there is no commercial intent?"

Wednesday, January 27, 2010

Yahoo on personalizing content and ads

Yahoo CEO Carol Bartz had a few tidbits on personalized relevance for content and advertising in the recent Yahoo Q4 2009 earnings call. Some excerpts:

We generate value ... [through] the vast amount of data we gather and use to deliver a better, more personal experience for users and a better, more targeted audience for our advertisers.

Since we began paring our content optimization technology with editorial expertise we have seen click through rates in the Today module more than double ... We are making additional improvements to the technology that will make the user experience even more personally relevant.

Truth be told, no one has uncovered the holy grail of making advertising as relevant as content is 100% of the time. Beyond just offering advertisers a specific bucket, say women aged 35-45 and have children, we instead need to deliver many more specific attributes of scale. For example, women aged 35-45 with kids under three who are shopping for a minivan, and on and on and on and on. If we can do this we can create a better experience for both the user and the advertiser.

We have been letting great data about the consumers, data that is very attractive to advertisers fall to the floor ... We simply aren't even close to maximizing the value of our massive audience for advertisers.

Sounds like the goal is right, but the pace is slow. For more on that, please see also my June 2009 post, "Yahoo CEO Carol Bartz on personalization".

Sunday, January 24, 2010

Hybrid, not artificial, intelligence

Google VP Alfred Spector gave a talk last week at University of Washington Computer Science on "Research at Google". Archived video is available.

What was unusual about Al's talk was his focus on cooperation between computers and humans to allow both to solve harder problems than they might be able to otherwise.

Starting at 8:30 in the talk, Al describes this as a "virtuous cycle" of improvement using people's interactions with an application, allowing optimizations and features like like learning to rank, personalization, and recommendations that might not be possible otherwise.

Later, around 33:20, he elaborates, saying we need "hybrid, not artificial, intelligence." Al explains, "It sure seems a lot easier ... when computers aren't trying to replace people but to help us in what we do. Seems like an easier problem .... [to] extend the capabilities of people."

Al goes on to say the most progress on very challenging problems (e.g. image recognition, voice-to-text, personalized education) will come from combining several independent, massive data sets with a feedback loop from people interacting with the system. It is an "increasingly fluid partnership between people and computation" that will help both solve problems neither could solve on their own.

This being a Google Research talk, there was much else covered, including the usual list of research papers out of Google, solicitation of students and faculty, pumping of Google as the best place to access big data and do research on big data, and a list of research challenges. The most interesting of the research challenges were robust, high performance, transparent data migration in response to load in massive clusters, ultra-low power computing (e.g. powered only by ambient light), personalized education where computers learn and model the needs of their students, and getting outside researchers access to the big data they need to help build hybrid, not artificial, intelligence.

Wednesday, January 20, 2010

Predictions for 2010

It's that time of year again. Many are making their predictions for the tech industry for 2010.

It's been a while since I played this game -- last time was my dark prediction for a dot-com crash in 2008 ([1] [2]) -- but I thought I'd try again this year.

I wrote up my predictions in a post over at blog@CACM, "What Will 2010 Bring?"

Because it is for the CACM, the predictions focus more on computing in general than on startups, recommendations, or search. And, they are phrased as questions than as predictions.

I think the answer to some of the questions I posed may be no. For example, I doubt tablets will succeed this time around, don't think enterprises will move to the public cloud as much as expected, and am not sure that personalized advertising will always be used to benefit consumers. I do think netbooks are a dead market, mobile devices will become standardized and more like computers, and that 2010 will see big advances in local search and augmented reality on mobile devices.

If you have any thoughts on these predictions or some of your own to add, please comment either here or at blog@CACM.

Update: Another prediction, not in that list, that might be worth including here, "Who Needs Massively Multi-Core?"

Monday, January 04, 2010

Lectures on Computational Advertising

Slides from all the lectures of Andrei Broder's recent Computational Advertising class at Stanford University now are available online. Andrei is a VP and Chief Scientist at Yahoo and leads their Advertising Technology Group.

Lecture 6 (PDF) is particularly interesting with its coverage of learning to rank. Lecture 8 (PDF) has a tidbit on behavioral advertising and using recommender systems for advertising, but it is very brief. The first few lectures are introductory; don't miss lecture 3 (PDF) if you are new to sponsored search and want a good dive into the issues and techniques.

Thursday, December 31, 2009

YouTube needs to entertain

Miguel Helft at the New York Times has a good article this morning, "YouTube's Quest to Suggest More", on how YouTube is trying "to give its users what they want, even even when the users aren't quite sure what that is."

The article focuses on YouTube's "plans to rely more heavily on personalization and ties between users to refine recommendations" and "suggesting videos that users may want to watch based on what they have watched before, or on what others with similar tastes have enjoyed."

What is striking about this is how little this has to do with search. As described in the article, what YouTube needs to do is entertain people who are bored but do not entirely know what they want. YouTube wants to get from users spending "15 minutes a day on the site" closer to the "five hours in front of the television." This is entertainment, not search. Passive discovery, playlists of content, deep classification hierarchies, well maintained catalogs, and recommendations of what to watch next will play a part; keyword search likely will play a lesser role.

And it gets back to the question of how different of a problem Google is taking on with YouTube. Google is about search, keyword advertising, and finding content other people own. YouTube is about entertainment, discovery, content advertising, and cataloging and managing content they control. While Google certainly has the talent to succeed in new areas, it seems they are only now realizing how different YouTube is.

If you are interested in more on this, please see my Oct 2006 post, "YouTube is not Googly". Also, for a little on the technical challenges behind YouTube recommendations and managing a video catalog, please see my earlier posts "Video recommendations on YouTube" and "YouTube cries out for item authority".

Monday, December 28, 2009

Most popular posts of 2009

In case you might have missed them, here is a selection of some of the most popular posts on this blog in the last year.

Jeff Dean keynote at WSDM 2009
Describes Google's architecture and computational power
Put that database in memory
Claims in-memory databases should be used more often
How Google crawls the deep web
How Google probes and crawls otherwise hidden databases on the Web
Advice from Google on large distributed systems
Extends the first post above with more of an emphasis on how Google builds software
Details on Yahoo's distributed database
A look at another large scale distributed database
Book review: Introduction to Information Retrieval
A detailed review of Manning et al.'s fantastic new book. Please see also a recent review of Search User Interfaces.
Google server and data center details
Even more on Google's architecture, this one focused on data center cost optimization
Starting Findory: The end
A summary of and links to my posts describing what I learned at my startup, Findory, over its five years.

Overall, according to Google Analytics, the blog had 377,921 page views and 233,464 unique visitors in 2009. It has about 10k regular readers subscribed to its feed. I hope everyone is finding it useful!

Wednesday, December 16, 2009

Toward an external brain

I have a post up on blog@CACM, "The Rise of the External Brain", on how search over the Web is achieving what classical AI could not, an external brain that supplements our intelligence, knowledge, and memories.

Tuesday, December 08, 2009

Personalized search for all at Google

As has been widely reported, Google is now personalizing web search results for everyone who uses Google, whether logged in or not.

Danny Sullivan at Search Engine Land has particularly good coverage. An excerpt:

Beginning today, Google will now personalize the search results of anyone who uses its search engine, regardless of whether they've opted-in to a previously existing personalization feature.

The short story is this. By watching what you click on in search results, Google can learn that you favor particular sites. For example, if you often search and click on links from Amazon that appear in Google's results, over time, Google learns that you really like Amazon. In reaction, it gives Amazon a ranking boost. That means you start seeing more Amazon listings, perhaps for searches where Amazon wasn't showing up before.

Searchers will have the ability to opt-out completely, and there are various protections designed to safeguard privacy. However, being opt-out rather than opt-in will likely raise some concerns.

There now appears to be a big push at Google for individualized targeting and personalization in search, advertising, and news. Google now appears to be going full throttle on personalization, choosing it as the way forward to improve relevance and usefulness.

With only one generic relevance rank, Google has been finding it is increasingly difficult to improve search quality because not everyone agrees on how relevant a particular page is to a particular search. At some point, to get further improvements, Google has to customize relevance to each person's definition of relevance. When you do that, you have personalized search.

For more on recent moves to personalize news and advertising at Google, please see my posts, "Google CEO on personalized news" and "Google AdWords now personalized".

Update: Two hours later, Danny Sullivan writes a second post, "Google's Personalized Results: The 'New Normal' That Deserves Extraordinary Attention", that also is well worth reading.

Thursday, December 03, 2009

Recrawling and keeping search results fresh

A paper by three Googlers, "Keeping a Search Engine Index Fresh: Risk and Optimality in Estimating Refresh Rates for Web Pages" (not available online), is one of several recent papers looking at "the cost of a page being stale versus the cost of [recrawling]."

The core idea here is that people care a lot about some changes to web pages and don't care about others, and search engines need to respond to that to make search results relevant.

Unfortunately, our Googlers punt on the really interesting problem here, determining the cost of a page being stale. They simply assume any page that is stale hurts relevance the same amount.

That clearly is not true. Not only do some pages appear more frequently than other pages in search results, but also some changes to pages matter more to people than others.

Getting at the cost of being stale is difficult, but a good start is "The Impact of Crawl Policy on Web Search Effectiveness" (PDF) recently presented at SIGIR 2009. It uses PageRank and in-degree as a rough estimate of what pages people will see and click on in search results, then explores the impact of pages people want more frequently.

But that still does not capture whether the change is something people care about. Is, for example, the change below the fold on the page, so less likely to be seen? Is the change correcting a typo or changing an advertisement? In general, what is the cost of showing stale information for this page?

"Resonance on the Web: Web Dynamics and Revisitation Patterns" (PDF), recently presented at CHI, starts to explore that question, looking at the relationship between web content change and how much people want to revisit the pages, as well as thinking about the question of what is an interesting content change.

As it turns out, news is something where change matters and people revisit frequently, and there have been several attempts to treat real-time content such as news differently in search results. One recent example is "Click-Through Prediction for News Queries" (PDF), presented at SIGIR 2009, that describes one method of trying to know when people will want to see news articles for a web search query.

But, rather than coming up with rules for when content from various federated sources should be shown, I wonder if we cannot find a simpler solution. All of these works strive toward the same goal, understanding when people care about change. Relevance depends on what we want, what we see, and what we notice. Search results need only to appear fresh.

Recrawling high PageRank pages is a very rough attempt at making results appear fresh, since high PageRank means a page more likely to be shown and noticed at the top of search results, but it clearly is a very rough approximation. What we really want to know is: Who will see a change? If people see it, will they notice? If they notice, will they care?

Interestingly, people's actions tell us a lot about what they care about. Our wants and needs, where our attention lies, all live in our movements across the Web. If we listen carefully, these voices may speak.

For more on that, please see also my older posts, "Google toolbar data and the actual surfer model" and "Cheap eyetracking using mouse tracking".

Update: One month later, an experiment shows that new content on the Web can be generally available on Google search within 13 seconds.

Thursday, November 19, 2009

Continuous deployment at Facebook

E. Michael Maximilien has a post, "Extreme Agility at Facebook", on blog@CACM. The post reports on a talk at OOPSLA by Robert Johnson (Director of Engineering at Facebook) titled "Moving Fast at Scale".

Here is an interesting excerpt on very frequent deployment of software and how it reduces downtime:

Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of system and surely fixing any bugs that would result from these frequent small changes.

Second, there is limited QA (quality assurance) teams at Facebook but lots of peer review of code. Since the Facebook engineering team is relatively small, all team members are in frequent communications. The team uses various staging and deployment tools as well as strategies such as A/B testing, and gradual targeted geographic launches.

This has resulted in a site that has experienced, according to Robert, less than 3 hours of down time in the past three years.

For more on the benefits of deploying software very frequently, not just for Facebook but for many software companies, please see also my post on blog@CACM, "Frequent Releases Change Software Engineering".

Monday, November 16, 2009

Put that database in memory

An upcoming paper, "The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM" (PDF), makes some interesting new arguments for shifting most databases to serving entirely out of memory rather than off disk.

The paper looks at Facebook as an example and points out that, due to aggressive use of memcached and caches in mysql, the memory they use already is about "75% of the total size of the data (excluding images)." They go on to argue that a system designed around in-memory storage with disk just used for archival purposes would be much simpler, more efficient, and faster. They also look at examples of smaller databases and note that, with servers getting to 64G of RAM and higher and most databases just a couple terabytes, it doesn't take that many servers to get everything in memory.

An excerpt from the paper:

Developers are finding it increasingly difficult to scale disk-based systems to meet the needs of large-scale Web applications. Many people have proposed new approaches to disk-based storage as a solution to this problem; others have suggested replacing disks with flash memory devices.

In contrast, we believe that the solution is to shift the primary locus of online data from disk to random access memory, with disk relegated to a backup/archival role ... [With] all data ... in DRAM ... [we] can provide 100-1000x lower latency than disk-based systems and 100-1000x greater throughput .... [while] eliminating many of the scalability issues that sap developer productivity today.

One subtle but important point the paper makes is that the slow speed of current databases have made web applications both more complicated and more limited than they should be. From the paper:

Traditional applications expect and get latency significantly less than 5-10 μs ... Because of high data latency, Web applications typically cannot afford to make complex unpredictable explorations of their data, and this constrains the functionality they can provide. If Web applications are to replace traditional applications, as has been widely predicted, then they will need access to data with latency much closer to what traditional applications enjoy.

Random access with very low latency to very large datasets ... will not only simplify the development of existing applications, but they will also enable new applications that access large amounts of data more intensively than has ever been possible. One example is ... algorithms that must traverse large irregular graph structures, where the access patterns are ... unpredictable.

The authors point out that data access patterns currently need to be heavily optimized, carefully ordered, and must conservatively acquire extra data in case it is later needed, all things that mostly go away if you are using a database where access has microsecond latency.

While the authors do not go as far as to argue that memory-based databases are cheaper, they do argue that they are cost competitive, especially once developer time is taken into account. It seems to me that you could go a step further here and argue very low latency databases brings such large productivity gains to developers and benefits to application users that they are in fact cheaper, but the paper does not try to do that.

If you don't have time to read the paper, slides (PDF) are also available that are very quick to skim from a talk by one of the authors.

If you can't get enough of this topic, please see my older post, "Replication, caching, and partitioning", which argues that big caching layers, such as memcached, are overdone compared to having each database shard serve most data out of memory.

HT, James Hamilton, for first pointing to the RAMClouds slides.

Thursday, November 12, 2009

The reality of doing a startup

Paul Graham has a fantastic article up, "What Startups Are Really Like", with the results of what happened when he asked all the founders of the Y Combinator startups "what surprised them about starting a startup."

A brief excerpt summarizing the findings:

Unconsciously, everyone expects a startup to be like a job, and that explains most of the surprises. It explains why people are surprised how carefully you have to choose cofounders and how hard you have to work to maintain your relationship. You don't have to do that with coworkers. It explains why the ups and downs are surprisingly extreme. In a job there is much more damping. But it also explains why the good times are surprisingly good: most people can't imagine such freedom. As you go down the list, almost all the surprises are surprising in how much a startup differs from a job.

There are 19 surprises listed in the essay. Below are excerpts from some of them:

Be careful who you pick as a cofounder ... [and] work hard to maintain your relationship.

Startups take over your life ... [You will spend] every waking moment either working or thinking about [your] startup.

It's an emotional roller-coaster ... How low the lows can be ... [though] it can be fun ... [But] starting a startup is fun the way a survivalist training course would be fun, if you're into that sort of thing. Which is to say, not at all, if you're not.

Persistence is the key .... [but] mere determination, without flexibility ... may get you nothing.

You have to do lots of different things ... It's much more of a grind than glamorous.

When you let customers tell you what they're after, they will often reveal amazing details about what they find valuable as well what they're willing to pay for.

You can never tell what will work. You just have to do whatever seems best at each point.

Expect the worst with deals ... Deals fall through.

The degree to which feigning certitude impressed investors .... A lot of what startup founders do is just posturing. It works.

How much of a role luck plays and how much is outside of [your] control ... Having skill is valuable. So is being determined as all hell. But being lucky is the critical ingredient ... Founders who succeed quickly don't usually realize how lucky they were.

Definitely worth reading the entire article if you are at all considering a startup.

For my personal take on some surprises I hit, please see my earlier post on Starting Findory.

Tuesday, November 10, 2009

Scary data on Botnet activity

An amusingly titled paper to be presented at the CSS 2009 conference, "Your Botnet is My Botnet: Analysis of a Botnet Takeover" (PDF), contains some not-so-funny data on how sophisticated hijacking computers has now become, the data they are able to collect, and the profits that fuel the development of more and more dangerous botnets.

Extended excerpts from the paper, focusing on the particularly scary bits:

We describe our experience in actively seizing control of the Torpig (a.k.a. Sinowal, or Anserin) botnet for ten days. Torpig ... has been described ... as "one of the most advanced pieces of crimeware ever created." ... The sophisticated techniques it uses to steal data from its victims, the complex network infrastructure it relies on, and the vast financial damage that it causes set Torpig apart from other threats.

Torpig has been distributed to its victims as part of Mebroot. Mebroot is a rootkit that takes control of a machine by replacing the system's Master Boot Record (MBR). This allows Mebroot to be executed at boot time, before the operating system is loaded, and to remain undetected by most anti-virus tools.

Victims are infected through drive-by-download attacks ... Web pages on legitimate but vulnerable web sites ... request JavaScript code ... [that] launches a number of exploits against the browser or some of its components, such as ActiveX controls and plugins. If any exploit is successful ... an installer ... injects a DLL into the file manager process (explorer.exe) ... [that] makes all subsequent actions appear as if they were performed by a legitimate system process ... loads a kernel driver that wraps the original disk driver (disk.sys) ... [and] then overwrite[s] the MBR of the machine with Mebroot.

Mebroot has no malicious capability per se. Instead, it provides a generic platform that other modules can leverage to perform their malicious actions ... Immediately after the initial reboot ... [and] in two-hour intervals ... Mebroot contacts the Mebroot C&C server to obtain malicious modules ... All communication ... is encrypted.

The Torpig malware ... injects ... DLLs into ... the Service Control Manager (services.exe), the file manager, and 29 other popular applications, such as web browsers (e.g., Microsoft Internet Explorer, Firefox, Opera), FTP clients (CuteFTP, LeechFTP), email clients (e.g., Thunderbird, Outlook, Eudora), instant messengers (e.g., Skype, ICQ), and system programs (e.g., the command line interpreter cmd.exe). After the injection, Torpig can inspect all the data handled by these programs and identify and store interesting pieces of information, such as credentials for online accounts and stored passwords. ... Every twenty minutes ... Torpig ... upload[s] the data stolen.

Torpig uses phishing attacks to actively elicit additional, sensitive information from its victims, which, otherwise, may not be observed during the passive monitoring it normally performs ... Whenever the infected machine visits one of the domains specified in the configuration file (typically, a banking web site), Torpig ... injects ... an HTML form that asks the user for sensitive information, for example, credit card numbers and social security numbers. These phishing attacks are very difficult to detect, even for attentive users. In fact, the injected content carefully reproduces the style and look-and-feel of the target web site. Furthermore, the injection mechanism defies all phishing indicators included in modern browsers. For example, the SSL configuration appears correct, and so does the URL displayed in the address bar.

Consistent with the past few years' shift of malware from a for-fun (or notoriety) activity to a for-profit enterprise, Torpig is specifically crafted to obtain information that can be readily monetized in the underground market. Financial information, such as bank accounts and credit card numbers, is particularly sought after. In ten days, Torpig obtained the credentials of 8,310 accounts at 410 different institutions ... 1,660 unique credit and debit card numbers .... 297,962 unique credentials (username and password pairs) .... [in] information that was sent by more than 180 thousand infected machines.

The paper estimates the value of the data collected by this sophisticated piece of malware to be between $3M - $300M/year on the black market.

[Paper found via Bruce Schneier]

Saturday, November 07, 2009

Starting Findory: The end

This is the end of my Starting Findory series.

Findory was my first startup and a nearly five year effort. Its goal of personalizing information was almost laughably ambitious, a joy to pursue, and I learned much.

I learned that a cheap is good, but too cheap is bad. It does little good to avoid burning too fast only to starve yourself of what you need.

I re-learned the importance of a team, one that balances the weaknesses of some with the strengths of another. As fun as learning new things might be, trying to do too much yourself costs the startup too much time in silly errors born of inexperience.

I learned the necessity of good advisors, especially angels and lawyers. A startup needs people who can provide expertise, credibility, and connections. You need advocates to help you.

And, I learned much more, some of which is detailed in the other posts in the Starting Findory series:

I hope you enjoyed these posts about my experience trying to build a startup. If you did like this Starting Findory series, you might also be interested in my Early Amazon posts. They were quite popular a few years ago.

Wednesday, November 04, 2009

Using only experts for recommendations

A recent paper from SIGIR, "The Wisdom of the Few: A Collaborative Filtering Approach Based on Expert Opinions from the Web" (PDF), has a very useful exploration into the effectiveness of recommendations using only a small pool of trusted experts.

The results suggest that using a small pool of a couple hundred experts, possibly your own experts or experts selected and mined from the web, has quite a bit of value, especially in cases where big data from a large community is unavailable.

A brief excerpt from the paper:

Recommending items to users based on expert opinions .... addresses some of the shortcomings of traditional CF: data sparsity, scalability, noise in user feedback, privacy, and the cold-start problem .... [Our] method's performance is comparable to traditional CF algorithms, even when using an extremely small expert set .... [of] 169 experts.

Our approach requires obtaining a set of ... experts ... [We] crawled the Rotten Tomatoes web site –- which aggregates the opinions of movie critics from various media sources -- to obtain expert ratings of the movies in the Netflix data set.

The authors certainly do not claim that using a small pool of experts is better than traditional collaborative filtering.

What they do say is that using a very small pool of experts works surprisingly well. In particular, I think it suggests a good alternative to content-based methods for bootstrapping a recommender system. If you can create a high quality pool of experts, even a fairly small one, you may have good results starting with that while you work to gather ratings from the broader community.

Thursday, October 29, 2009

Google CEO on personalized news

Google CEO Eric Schmidt has been talking quite a bit about personalization in online news recently. First, Eric said:

We and the industry ... [should] personalize the news.

At its best, the on-line version of a newspaper should learn from the information I'm giving it -- what I've read, who I am and what I like -- to automatically send me stories and photos that will interest me.

Then, Eric described how newspapers could make money using personalized advertising:

Imagine a magazine online that knew everything about you, knew what you had read, allowed you to go deep into a subject and also showed you things... that are serendipit[ous] ... popular ... highly targetable ... [and] highly advertisable. Ultimately, money will be made.

Finally, Eric claimed Google has a moral duty to help newspapers succeed:

Google sees itself as trying to make the world a better place. And our values are that more information is positive -- transparency. And the historic role of the press was to provide transparency, from Watergate on and so forth. So we really do have a moral responsibility to help solve this problem.

Well-funded, targeted professionally managed investigative journalism is a necessary precondition in my view to a functioning democracy ... That's what we worry about ... There [must be] enough revenue that ... the newspaper [can] fulfill its mission.

Eric's words come at a time when, as the New York Times reports, newspapers are cratering, with "revenue down 16.6 percent last year and about 28 percent so far this year."

For more on personalized news, please see my earlier posts, "People who read this article also read", "A brief history of Findory", and "Personalizing the newspaper".

For more on personalized advertising, please see my July 2007 post, "What to advertise when there is no commercial intent?"

Update: Some more useful references in the comments.

Update: Five weeks later, Eric Schmidt, in the WSJ, imagines a newspaper that "knows who I am, what I like, and what I have already read" and that makes sure that "like the news I am reading, the ads are tailored just for me" instead of being "static pitches for products I'd never use." He also criticizes newspapers for treating readers "as a stranger ... every time [they] return."

Wednesday, October 21, 2009

Advice from Google on large distributed systems

Google Fellow Jeff Dean gave a keynote talk at LADIS 2009 on "Designs, Lessons and Advice from Building Large Distributed Systems". Slides (PDF) are available.

Some of this talk is similar to Jeff's past talks but with updated numbers. Let me highlight a few things that stood out:

A standard Google server appears to have about 16G RAM and 2T of disk. If we assume Google has 500k servers (which seems like a low-end estimate given they used 25.5k machine years of computation in Sept 2009 just on MapReduce jobs), that means they can hold roughly 8 petabytes of data in memory and, after x3 replication, roughly 333 petabytes on disk. For comparison, a large web crawl with history, the Internet Archive, is about 2 petabytes and "the entire [written] works of humankind, from the beginning of recorded history, in all languages" has been estimated at 50 petabytes, so it looks like Google easily can hold an entire copy of the web in memory, all the world's written information on disk, and still have plenty of room for logs and other data sets. Certainly no shortage of storage at Google.

Jeff says, "Things will crash. Deal with it!" He then notes that Google's datacenter experience is that, in just one year, 1-5% of disks fail, 2-4% of servers fail, and each machine can be expected to crash at least twice. Worse, as Jeff notes briefly in this talk and expanded on in other talks, some of the servers can have slowdowns and other soft failure modes, so you need to track not just up/down states but whether the performance of the server is up to the norm. As he has said before, Jeff suggests adding plenty of monitoring, debugging, and status hooks into your systems so that, "if your system is slow or misbehaving" you can quickly figure out why and recover. From the application side, Jeff suggests apps should always "do something reasonable even if it is not all right" on a failure because it is "better to give users limited functionality than an error page."

Jeff emphasizes the importance of back of the envelope calculations on performance, "the ability to estimate the performance of a system design without actually having to build it." To help with this, on slide 24, Jeff provides "numbers everyone should know" with estimates of times to access data locally from cache, memory, or disk and remotely across the network. On the next slide, he walks through an example of estimating the time to render a page with 30 thumbnail images under several design options. Jeff stresses the importance of having an at least high-level understanding of the operation of the performance of every major system you touch, saying, "If you don't know what's going on, you can't do decent back-of-the-envelope calculations!" and later adding, "Think about how much data you're shuffling around."

Jeff makes an insightful point that, when designing for scale, you should design for expected load, ensure it still works at x10, but don't worry about scaling to x100. The problem here is that x100 scale usually calls for a different and usually more complicated solution than what you would implement for x1; a x100 solution can be unnecessary, wasteful, slower to implement, and have worse performance at a x1 load. I would add that you learn a lot about where the bottlenecks will be at x100 scale when you are running at x10 scale, so it often is better to start simpler, learn, then redesign rather than jumping into a more complicated solution that might be a poor match for the actual load patterns.

The talk covers BigTable, which was discussed in previous talks but now has some statistics updated, and then goes on to talk about a new storage and computation system called Spanner. Spanner apparently automatically moves and replicates data based on usage patterns, optimizes the resources of the entire cluster, uses a hierarchical directory structure, allows fine-grained control of access restrictions and replication on the data, and supports distributed transactions for applications that need it (and can tolerate the performance hit). I have to say, the automatic replication of data based on usage sounds particularly cool; it has long bothered me that most of these data storage systems create three copies for all data rather than automatically creating more than three copies of frequently accessed head data (such as the last week's worth of query logs) and then disposing of the extra replicas when they are no longer in demand. Jeff says they want Spanner to scale to 10M machines and an exabyte (1k petabytes) of data, so it doesn't look like Google plans on cutting their data center growth or hardware spend any time soon.

Data center guru James Hamilton was at the LADIS 2009 talk and posted detailed notes. Both James' notes and Jeff's slides (PDF) are worth reviewing.

Monday, October 19, 2009

Using the content of music for search

I don't know much about analyzing music streams to find similar music, which is part of why I much enjoyed reading "Content-Based Music Information Retrieval" (PDF). It is a great survey of the techniques used, helpfully points to a few available tools, and gives several examples of interesting research projects and commercial applications.

Some extended excerpts:

At present, the most common method of accessing music is through textual metadata .... [such as] artist, album ... track title ... mood ... genre ... [and] style .... but are not able to easily provide their users with search capabilities for finding music they do not already know about, or do not know how to search for.

For example ... Shazam ... can identify a particular recording from a sample taken on a mobile phone in a dance club or crowded bar ... Nayio ... allows one to sing a query and attempts to identify the work .... [In] Musicream ... icons representing pieces flow one after another ... [and] by dragging a disc in the flow, the user can easily pick out other similar pieces .... MusicRainbow ... [determines] similarity between artists ... computed from the audio-based similarity between music pieces ... [and] the artists are then summarized with word labels extracted from web pages related to the artists .... SoundBite ... uses a structural segmentation [of music tracks] to generate representative thumbnails for [recommendations] and search.

An intuitive starting point for content-based music information retrieval is to use musical concepts such as melody or harmony to describe the content of music .... Surprisingly, it is not only difficult to extract melody from audio but also from symbolic representations such as MIDI files. The same is true of many other high-level music concepts such as rhythm, timbre, and harmony .... [Instead] low-level audio features and their aggregate representations [often] are used as the first stage ... to obtain a high-level representation of music.

Low-level audio features [include] frame-based segmentations (periodic sampling at 10ms - 1000ms intervals), beat-synchronous segmentations (features aligned to musical beat boundaries), and statistical measures that construct probability distributions out of features (bag of features models).

Estimation of the temporal structure of music, such as musical beat, tempo, rhythm, and meter ... [lets us] find musical pieces having similar tempo without using any metadata .... The basic approach ... is to detect onset times and use them as cues ... [and] maintain multiple hypotheses ... [in] ambiguous situations.

Melody forms the core of Western music and is a strong indicator for the identity of a musical piece ... Estimated melody ... [allows] retrieval based on similar singing voice timbres ... classification based on melodic similarities ... and query by humming .... Melody and bass lines are represented as a continuous temporal-trajectory representation of fundamental frequency (F0, perceived as pitch) or a series of musical notes .... [for] the most predominant harmonic structure ... within an intentionally limited frequency range.

Audio fingerprinting systems ... seek to identify specific recordings in new contexts ... to [for example] normalize large music content databases so that a plethora of versions of the same recording are not included in a user search and to relate user recommendation data to all versions of a source recording including radio edits, instrumental, remixes, and extended mix versions ... [Another example] is apocrypha ... [where] works are falsely attributed to an artist ... [possibly by an adversary after] some degree of signal transformation and distortion ... Audio shingling ... [of] features ... [for] sequences of 1 to 30 seconds duration ... [using] LSH [is often] employed in real-world systems.

The paper goes into much detail on these topics as well as covering other areas such as chord and key recognition, chorus detection, aligning melody and lyrics (for Karaoke), approximate string matching techniques for symbolic music data (such as matching noisy melody scores), and difficulties such as polyphonic music or scaling to massive music databases. There also is a nice pointer to publicly available tools for playing with these techniques if you are so inclined.

By the way, for a look at an alternative to these kinds of automated analyses of music content, don't miss this last Sunday's New York Times Magazine section article, "The Song Decoders", describing Pandora's effort to manually add fine-grained mood, genre, and style categories to songs and articles and then use it for finding similar music.

Friday, October 16, 2009

An arms race in spamming social software

Security guru Bruce Schneier has a great post up, "The Commercial Speech Arms Race", on the difficulty of eliminating spam in social software. An excerpt:

When Google started penalising a site's search engine rankings for having ... link farms ... [then] people engaged in sabotage: they built link farms and left blog comment spam to their competitors' sites.

The same sort of thing is happening on Yahoo Answers. Initially, companies would leave answers pushing their products, but Yahoo started policing this. So people have written bots to report abuse on all their competitors. There are Facebook bots doing the same sort of thing.

Last month, Google introduced Sidewiki, a browser feature that lets you read and post comments on virtually any webpage ... I'm sure Google has sophisticated systems ready to detect commercial interests that try to take advantage of the system, but are they ready to deal with commercial interests that try to frame their competitors?

This is the arms race. Build a detection system, and the bad guys try to frame someone else. Build a detection system to detect framing, and the bad guys try to frame someone else framing someone else. Build a detection system to detect framing of framing, and well, there's no end, really.

Commercial speech is on the internet to stay; we can only hope that they don't pollute the social systems we use so badly that they're no longer useful.

An example that Bruce did not mention is shill reviews on Amazon and elsewhere, something that appears to have become quite a problem nowadays. The most egregious example of this is paying people using Amazon MTurk to write reviews, as CMU professor Luis voh Ahn detailed a few months ago.

Some of the spam can be detected using algorithms, looking for atypical behaviors in text or actions, and using community feedback, but even community feedback can be manipulated. It is common, for example, to see negative reviews get a lot of "not helpful" votes on Amazon.com, which, at least in some cases, appears to be the work of people who might gain from suppressing those reviews. An arms race indeed.

An alternative to detection is to go after the incentive to spam, trying to reduce the reward from spamming. The winner-takes-all effect of search engine optimization -- where being the top result for a query has enormous value because everyone sees it -- could be countered, for example, by showing different results to different people. For more on that, please see my old July 2006 post, "Combating web spam with personalization".