Geeking with Greg

Monday, September 15, 2008

KDD talk on advertising auctions

Michael Schwarz from Yahoo Research gave an invited talk at KDD 2008 on "Internet Advertising and Optimal Auction Design".

Much of the talk was background on first and second price auctions, but a particularly good discussion came at the end when Michael was talking about optimal setting of reserve prices for auctions.

A reserve price is the minimum bid for an advertiser to win an auction. Michael discussed how setting a reserve price correctly can significantly improve advertising revenue.

He did not cover the assumptions behind this claim in his talk, but details are available from his 2007 paper, "Optimal Auction Design in a Multi-unit Environment" (PDF). Briefly, setting the reserve price carefully can substantially increase advertising revenue when there are only a few bidders and when most bidders value a click well above the normal reserve price (e.g. most advertisers get $1 of value from a click but the default reserve price is $.05).

Michael also described a non-obvious effect where higher reserve prices indirectly impact top bidders because of price pressure from the now higher lower bids. Again, this is primarily is an issue of the reserve price substituting for lack of competition in the bidding, but still a useful thing to note.

After the talk, I chatted a bit with Michael about optimizing the reserve price for things other than immediate revenue. For example, we might want to optimize reserve price for user satisfaction, setting the reserve price at a level that effectively represents a high cost of user attention and filters out ads that users do not find worthwhile. Or, we might want to set our reserve price lower in the long tail of advertising keywords, making it cheaper (and bearing some of the risk) for advertisers as they try to gather information about how well their ads might work on more obscure keywords.

By the way, if you liked this post, you might also like one of Michael's other papers, "Internet Advertising and the Generalized Second Price Auction: Selling Billions of Dollars Worth of Keywords" (PDF). It is a fun paper that describes how current GSP advertising auctions do not actually force truth telling. The problem is rooted in the way there are multiple positions available and, if the lower positions still generate a good number of clicks, it can be advantageous for advertisers to offer lower bids.

Wednesday, September 03, 2008

KDD 2008 panel on the future of social networks

A panel at KDD 2008 called "Social Networks: Looking Ahead" was more remarkable for the questions it failed to answer than the ones it did.

One question that was repeatedly presented to the panel was how to make money off social networks. Underlying this question, and at least once stated explicitly, was the issue of whether social networks really matter or are just the latest flash in the pan, the latest bubble, and soon to fade if people find the riches they deliver fall short of their absurdly hyped expectations.

The best answer to this question came only at the very end, sneaking in when the session was already over, from panel chair Andrew Tompkins. Andrew said that he thought that what was most likely is that most money would be made from understanding an individual in a network and their immediate neighborhood. He expected that, as the space matured, the more theoretical work that we see now that looks at global patterns and trends across different types of social networks would continue, but the emphasis would shift to understanding each person in the network and from their behavior and their immediate neighbors to help people find and discover things they want.

Another question that came up and went largely unanswered was, are social networks really application specific? That is, are the patterns we see in one network (e.g. mobile) distinct from what we see in another (e.g. LinkedIn) because people use these tools differently? There was some evidence for this in one of the conference papers where there was a rather unusual distribution (PDF) reported in the contact lists of Sprint mobile customers because of how people use their mobile devices. The general issue here is that there is a different meaning of a link between people -- A friend? A colleague? Someone I e-mailed once? -- in each of these applications, so it is not clear that conclusions we draw from analyzing one social network generalize to others.

And this leads to another question, are social networks just a feature? That is, are social networks just a part of an application, not an thing worth anything by itself? For example, is our e-mail contact list only important in the context of e-mail? Is Facebook just a tool for communication, like IM, but not really best described as a social network? This came up a bit during the Q&A with the panel when someone suggested that perhaps we were just using one hammer -- graph analysis -- on every problem when the graph might not actually best capture what is important, the underlying activity in the application.

Though it was not discussed at all during the panel, two of the panelists focused their work on privacy, which made me wish I had asked, do we really have firm evidence that people care about privacy in these social applications? Instead, it seems people say they care about privacy but then freely give away private information when asked ([1]). From what I can tell, we do not really know to what extent privacy really matters to social networking application users, do we?

In the end, it seems many questions remain, some of them quite basic, on the where social networks are going. At this point, it is not even quite clear we know precisely what social networks are, much less whether they have an independent future.

Saturday, August 30, 2008

KDD talk on the Future of Image Search

Jitendra Malik from UC Berkeley gave an enjoyable and often quite amusing invited talk at KDD 2008 on "The Future of Image Search" where he argued that "shape-based object recognition is the key."

The talk started with Jitendra saying image search does not work well. To back this claim, he showed screen shots of searches on Google Image Search and Flickr for [monkey] and highlighted the false positives.

Jitendra then claimed that neither better analysis of the text around images nor more tagging will solve this problem because these techniques are missing the semantic component of images, that is, the shapes, concepts, and categories represented by the objects in the image.

Arguing that we need to move "from pixels to perception", Jitendra pushed for "category recognition" for objects in images where "objects have parts" and form a "partonomy", a hierarchical grouping of object parts. He said current attempts at this have at most 100 parts in the partonomy, but it appears humans commonly use 2-3 orders of magnitude more, 30k+ parts, when doing object recognition.

A common theme running through the talk was having a baseline model of a part, then being able to deform the part to match most variations. For example, he showed some famous examples from the 1917 text "On growth and form"

then talked about how to do these transformation to find the closest matching generic part for objects in an image.

A near the end of the talk, Jitendra contrasted the techniques used for face detection, which he called "a solved problem", with the techniques he thought would be necessary for general object and part recognition. He argued that the techniques used on face detection, to slide various sized windows across the image looking for things that have a face pattern to them, would both be too computationally expensive and have too high a false positive rate to do for 30k objects/parts. Jitendra said something new would have to be created to deal with large scale object recognition.

What that something new is is not clear, but Jitendra seemed to me to be pushing for a system that is biologically inspired, perhaps quickly coming up with rough candidate sets of interpretations of parts of the image, discounting interpretations that violate prior knowledge on objects that tend to occur nearby to each other, then repeating at additional levels of detail until steady state is reached.

Doing object recognition quickly, reliably, and effectively to find meaning in images remains a big, hairy, unsolved research problem, and probably will be for some time, but, if Jitendra is correct, it is the only way to make significant progress in image search.

Friday, August 22, 2008

Column versus row stores

Daniel Abadi, Samuel Madden, and Nabil Hachem had a paper at SIGMOD 2008, "Column-Stores vs. Row-Stores: How Different Are They Really?" (PDF), with a fascinating discussion of what optimizations could be implemented in a traditional row store database to make it behave like a column store.

The paper attempts to answer the question of whether there "is something fundamental about the way column-oriented DBMSs are internally architected" that yields performance gains or whether the column-oriented optimizations can be mimicked in a row-store design.

Specifically, the authors look at taking a row-store and vertically partitioning the rows, indexing all the columns, creating materialized views of the data (just the needed columns for a query already precomputed), compressing the data, and a couple variations on late materialization when joining columns that significantly impact performance.

Despite strong statements in the abstract and introduction that "it is not possible for a row-store to achieve some of the performance advantages of a column-store" and that "there is in fact something fundamental about the design of column-store systems that makes them better suited to data-warehousing workloads", the discussion and conclusion further into the paper are far murkier. For example, the conclusion states that it "is not that simulating a column-store in a row-store is impossible", just that attempting to fully simulate a column-store is difficult in "today's row store systems".

The murkiness seems to come from the difficulty the authors had forcing a row-store to do some of the query plan optimizations a column-store does once they had jiggered all the underlying data to simulate a column-store.

For example, when discussing "index-only plans" (which basically means throwing indexes on all the columns of a row-store), the row-store used slow hash-joins to combine indexed columns and they "couldn't find a way to force [the system] to defer these joins until later in the plan, which would have made the performance of this approach closer to vertical partitioning".

Later, the authors say that they couldn't "trick" the row-store into storing columns "in sorted order and then using a merge join", so it instead did "relatively expensive hash joins". They say the problems are not fundamental and that "there is no reason why a row-store cannot store tuple headers separately, use virtual record-ids to join data, and maintain heap files in guaranteed position order", just that they were not able to force the row-store to do these things in their simulation.

In the end, the paper only concludes that:

A successful column-oriented simulation [in a row-store] will require some important system improvements, such as virtual record-ids, reduced tuple overhead, fast merge joins of sorted data, run-length encoding across multiple tuples ... operating directly on compressed data ... and late materialization ... some of ... [which] have been implemented or proposed to be implemented in various row-stores.

That seems like a much weaker conclusion that what the introduction promised. The conclusion appears to be that a row-store can perform as well as a column-store on data warehouse workloads, just that it looks awkward and difficult to make all the optimizations necessary for it to do so.

Please see also two blog posts, "Debunking a Myth: Column-Stores vs. Indexes" and "Debunking Another Myth: Column-Stores vs. Vertical Partitioning", by the first author, Daniel Abadi, where he expands on some parts of the paper.

If you are interested in this topic, you might also enjoy two of my older posts, "MapReduce a step backwards?" and "C-Store and Google BigTable".

Going to KDD 2008

I'll be at KDD 2008 next week. If you make it there, please say hello!

Friday, August 15, 2008

Cassandra data store at Facebook

Avinash Lakshman, Prashant Malik, and Karthik Ranganathan presented a talk at SIGMOD 2008, "Cassandra: Structured Storage System over a P2P Network", that describes a Google Bigtable-like data store used actively by Facebook.

What strikes me as remarkable about the system is its rather casual treatment of writes. As far as I can tell, a write is only sent to memory on one box, not written to disk, not even written to multiple replicas in memory. That seems fine for log or click data, but, for the kind of data Facebook deals with, it seems a little surprising to not see a requirement for multiple replicas to get the write in memory before the app is told that the write succeeded.

The code for Cassandra is open source and there is a wiki that adds a few tidbits to the SIGMOD slides. Note that HBase is also open source and also is modeled after Google's Bigtable; HBase is layered on top of Hadoop and sponsored heavily by Yahoo.

Please see also James Hamilton, who posts a summary of the slides and brief commentary, and Dare Obasanjo, who offers more detailed commentary on Cassandra.

Please see also my post, "Highly available distributed hash storage from Amazon", about Amazon's Dynamo (PDF). There are similarities in the design and references in the wiki that suggest that Facebook's Cassandra was influenced by Amazon's Dynamo.

If you are interested in all things Bigtable, you might also enjoy Phil Bernstein's post, "Google Megastore", that summarizes a Google talk at SIGMOD 2008 on a storage system they built on top of Bigtable that adds transitions and additional indexes, among other things.

Update: Avinash Lakshman swings by in the comments and clarifies that Cassandra does have a commit log, so writes do go to disk on at least one machine immediately. A write first updates the commit log, then updates the memory tables, and, finally, in batch some time later, goes out to all the tables on disk.

Y Combinator's list of startup ideas

Paul Graham at Y Combinator posts an interesting list of "Startup Ideas We'd Like to Fund".

My favorites are Fix Advertising, Fixing E-mail Overload, and New News.

Found via John Cook, who notes that many are working on most of these ideas, but "ideas are a dime a dozen [and] it all comes down to execution." Can't argue with that.

Please see also "What to advertise when there is no commercial intent?", "Attention and Life Hacking", and Findory.

Thursday, August 07, 2008

Clever method of near duplicate detection

Martin Theobald, Jonathan Siddharth, and Andreas Paepcke from Stanford University have a cute idea in their SIGIR 2008 paper, "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections" (PDF). They focus near duplicate detection on the important parts of a web pages by using the next few words after a stop word as a signature.

An extended excerpt:

The frequent presence of many diverse semantic units in individual Web pages makes near-duplicate detection particularly difficult. Frame elements for branding, and advertisements are often freely interspersed with other content elements. The branding elements tend to be deliberately replicated across the pages of individual sites, creating a "noise level" of similarity among pages of the same site.

SpotSigs provides a robust scheme for extracting characteristic signatures from Web documents, thus aiming to filter natural-language text passages out of noisy Web page components ... [It] is much more efficient, easier to implement, and less error-prone than ... layout analysis .... [and] we are able to show very competitive runtimes to the fastest known but more error-prone schemes like I-Match and Locality Sensitive Hashing (LSH).

The points (or "spots") in the page at which spot signatures are generated are all the locations where ... anchor words [occur]. We call the anchor words antecedents, which are typically chosen to be frequent within the corpus. The most obvious, largely domain-independent choices ... are stopwords like is, the, do, have, etc. ... which are distributed widely and uniformly within any snippet of natural-language text.

A spot signature ... consists of a chain of words that follow an antecedent word .... Spot signatures ... [focus on] the natural-language passages of Web documents and skip over advertisements, banners, and navigational components.

The paper gives an example of a generating a spot signature where a piece of text like "at a rally to kick off a weeklong campaign" produces two spots: "a:rally:kick" and "a:weeklong:campaign".

What I particularly like about this paper is that they take a very hard problem and find a beautifully simple solution. Rather than taking on the brutal task of tearing apart the page layout to discard ads, navigation, and other goo, they just noted that the most important part of the page tends to be natural language text. By starting all their signatures at stopwords, they naturally focus the algorithm on the important parts of the page. Very cool.

If you can't get enough of this topic, please see also my previous post, "Detecting near duplicates in big data", that discusses a couple papers out of Google that look at this problem.

Update: Martin Theobald posted a detailed summary of the paper on the Stanford InfoBlog.

Wednesday, August 06, 2008

BrowseRank: Ranking pages by how people use them

Liu et al. from Microsoft Research Asia had the best student paper at SIGIR 2008, "BrowseRank: Letting Web Users Vote for Page Importance" (PDF), that builds a "user browsing graph" of web pages where "edges represent real transitions between web pages by users."

An excerpt:

The user browsing graph can more precisely represent the web surfer's random walk process, and thus is more useful for calculating page importance. The more visits of the page made by users and the longer time periods spent by the users on the page, the more likely the page is important ... We can leverage hundreds of millions of users' implicit voting on page importance.

Some websites like adobe.com are ranked very high by PageRank ... [because] Adobe.com has millions of inlinks for Acrobat Reader and Flash Player downloads. However, web users do not really visit such websites very frequently and they should not be regarded [as] more important than the websites on which users spend much more time (like myspace.com and facebook.com).

BrowseRank can successfully push many spam websites into the tail buckets, and the number of spam websites in the top buckets in BrowseRank is smaller than PageRank or TrustRank.

Experimental results show that BrowseRank indeed outperforms the baseline methods such as PageRank and TrustRank in several tasks.

One issue that came up, in discussions afterward about the paper, is whether BrowseRank gets something much different than a smoothed version of visit counts.

Let's say all the transition probabilities between web pages are set by how people actually move across the web, all the starting points on the web are set by where people actually start, and then you simulate random walks. If all this is done correctly, your random walkers should move like people do on the web and visit where they visit on the web. So, after all that simulating, it seems like what you get should be quite close to visit counts.

This is a bit of an oversimplification. BrowseRank uses time spent on a page in addition to visit counts and its Markov model of user behavior ends up smoothing the raw visit count data. Nevertheless, the paper does not compare BrowseRank to a simple ranking based on visit counts, so the question still appears to be open.

Please see also my earlier post, "Google Toolbar data and the actual surfer model".

Please see also my earlier post, "Ranking using Indiana University's user traffic", which discusses a WSDM 2008 paper that looked at supplementing PageRank with traffic data.

Tuesday, August 05, 2008

Caching, index pruning, and the query stream

A SIGIR 2008 paper out of Yahoo Research, "ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines" (ACM page), looks at how performance optimizations to a search engine can impact each other. In particular, it looks at how caching the results of search queries impacts the query load that needs to be served from the search index, and therefore changes the effectiveness of index pruning, which attempts to serve some queries out of a reduced index.

An extended excerpt:

Typically, search engines use caching to store previously computed query results, or to speed up the access to posting lists of popular terms.

[A] results cache is a fixed-capacity temporary storage of previously computed top-k query results. It returns the stored top-k results if a query is a hit or reports a miss otherwise.

Results caching is an attractive technique because there are efficient implementations, it enables good hit rates, and it can process queries in constant time. ... One important issue with caches of query results is that their hit rates are bounded ... [by] the large fraction of infrequent and singleton queries, even very large caches cannot achieve hit rates beyond 50-70%, independent of the cache size.

To overcome this limitation, a system can make use of posting list caching or/and employ a pruned version of the index, which is typically much smaller than the full index and therefore requires fewer resources to be implemented. Without affecting the quality of query results, such a static pruned index is capable of processing a certain fraction of queries thus further decreasing the query rate that reaches the main index.

[A] pruned index is a smaller version of the main index, which is stored on a separate set of servers. Such a static pruned index resolves a query and returns the response that is equivalent to what the full index would produce or reports a miss otherwise. We consider pruned index organizations that contain either shorter lists (document pruning), fewer lists (term pruning), or combine both techniques. Thus, the pruned index is typically much smaller than the main index.

The original stream of queries contains a large fraction of non-discriminative queries that usually consist of few frequent terms (e.g., navigational queries). Since frequent terms tend to be popular, those queries are likely to repeat in the query log and therefore are typical hits in the results cache. At the same time, these queries would be good candidates to be processed with the document pruned index due to their large result set size. Therefore, document pruning does not perform well anymore when the results cache is included in the architecture.

We observed that the results cache significantly changes the characteristics of the query stream: the queries that are misses in the results cache match approximately two orders of magnitude fewer documents on average than the original queries. However, the results cache has little effect on the distribution of query terms.

These observations have important implications for implementations of index pruning: the results cache only slightly affects the performance of term pruning, while document pruning becomes less effective, because it targets the same queries that are already handled by the results cache.

It is an excellent point that one optimization can impact another, so analyzing performance and configuration twiddles in isolation of the other optimizations is likely to give incorrect results. The results in this paper could be problematic for past work on index pruning since it indicates that the query streams used in their evaluations may not be representative of reality.

Monday, August 04, 2008

To personalize or not to personalize

Jaime Teevan, Sue Dumais, and Dan Liebling had a paper at SIGIR 2008, "To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent" (PDF), that looks at how and when different people want different things from the same search query.

An excerpt:

For some queries, everyone ... is looking for the same thing. For other queries, different people want very different results.

We characterize queries using features of the query, the results returned for the query, and people's interaction history with the query ... Using these features we build predictive models to identify queries that can benefit from personalized ranking.

We found that several click-based measures (click entropy and potential for personalization curves) reliability indicate when different people will find different results relevant for the same query .... We [also] found that features of the query string alone were able to help us predict variation in clicks.

Click entropy is a measure of variation in the clicks on the search results. Potential for personalization is a measure of how well any one ordering of results can match the ordering each individual searcher would most prefer. The query features that worked the best for predicting ambiguity of the query were query length, the number of query suggestions offered for the query, and whether the query contained a url fragment.

One tidbit I found interesting in the paper was that average click position alone was a strong predictor of ambiguity of the query and was well correlated with both click entropy and potential for personalization. That's convenient. Average click position seems like it should be much easier to calculate accurately when facing sparse data.

Please see also my older posts, "Effectiveness of personalized search", "Potential of web search personalization", and "Characterizing the value of personalized search".

Friday, August 01, 2008

Modeling how searchers look at search results

Georges Dupret and Benjamin Piwowarski from Yahoo Research had a great paper at SIGIR 2008, "A User Browsing Model to Predict Search Engine Click Data from Past Observations" (ACM page). It nicely extends earlier models of searcher behavior to allow for people skipping results after hitting a lot of uninteresting results.

An extended excerpt:

Click data seems the perfect source of information when deciding which documents (or ads) to show in answer to a query. It can be thought of as the results of users voting in favor of the documents they find interesting.

Nevertheless, [click data] cannot be used without further processing: A fundamental problem is the position bias. The probability of a document being clicked depends not only on its relevance, but on other factors as its position on the result page ... Eye-tracking experiments show that a user is less likely to examine results near the bottom of the list ... [and] that a document is not clicked with the same frequency if situated after a highly relevant or mediocre document.

The Cascade Model assumes that users view search results from top to bottom, deciding whether to click each result before moving to the next ... The cascade model is based on a simplistic behavior model: Users examine all the documents sequentially until they find a relevant document and then abandon the search.

We generalize this and allow for the possibility that a user skips a document without examining it .... We propose that the probability of examination is dependent on the distance d from the last click as well as on the position r in the ranking. The intuition behind using the distance is that a user tends to abandon the search after seeing a long sequence of unattractive snippets on the page.

Our solution outperforms very significantly all previous models ... Our findings confirm that [users] almost always see the document directly after the clicked document .. and explain why documents situated just after a very relevant document are clicked more often.

The paper also discusses a "multiple browsing model" that attempts to capture that different people browse differently on different types of queries (e.g. navigational vs. informational), but that model, surprisingly, failed to yield an improvement.

Please see also Craswell et al., "An Experimental Comparison of Click-Position Bias Models" (PDF), a fun WSDM 2008 paper that proposes a Cascade Model for how searchers review search results.

Wednesday, July 30, 2008

Easy processing of massive data sets

Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou have an upcoming paper at VLDB 2008, "SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets" (PDF), that describes a parallel data processing tool "being used daily ... inside Microsoft" on "large clusters ... of thousands of commodity servers" over "petabytes of data".

Scope is similar to Yahoo's Pig, which is a higher level language on top of Hadoop, or Google's Sawzall, which is a higher level language on top of MapReduce. But, where Pig focuses on and advocates a more imperative programming style, Scope looks much more like SQL.

Some excerpts from the paper:

Internet companies store and analyze massive data sets, such as search logs, web content collected by crawlers, and click streams collected from a variety of web services .... Several companies have developed distributed data storage and processing systems on large clusters of shared nothing commodity services including Google's File System, Bigtable, MapReduce, Hadoop, Yahoo!'s Pig system, Ask.com's Neptune, and Microsoft's Dryad.

We present a new scripting language, SCOPE. Users familiar with SQL require little or no training to use SCOPE. Like SQL, data is modeled as a set of rows composed of typed columns. Every rowset has a well-defined schema ... It allows users to focus on the data transformations required to solve the problem at hand and hides the complexity of the underlying platform and implementation details.

SCOPE [also] is highly extensible ... users can easily define their own ... operators: extractors (parsing and constructing rows from a file), processors (row-wise processing), reducers (group-wise processing), and combiners (combining rows from two inputs) ... [which] allows users to solve problems that cannot easily be expressed in traditional SQL.

The paper goes on to provide many examples of code written in Scope, describe the underlying Cosmos append-only distributed file system, discuss query plans for executing Scope programs, and touch on some of Scope's compile and run-time optimizations.

Though I probably shouldn't, let me add that, in my experience, Scope is great fun. It's like all the data and computation I can eat. And I can eat a lot.

Please see also my past posts on related work, including "Yahoo, Hadoop, and Pig Latin", "Sample programs in DryadLINQ", "Automatic optimization on large Hadoop clusters", and "Yahoo Pig and Google Sawzall".

Please see also my Dec 2005 post, "Making the impossible possible", which quotes Nobel Prize winner Tjalling Koopmans as saying, "Sometimes the solution to important problems ... [is] just waiting for the tool. Once this tool comes, everyone just flips in their head."

Monday, July 21, 2008

Kai-Fu Lee keynote at SIGIR

Googler Kai-Fu Lee gave a keynote talk yesterday at SIGIR 2008 on "The Google China Experience".

The Google China experience has been fraught with difficulties. As Kai-Fu said, Google was expecting that their search engine could succeed in China with the same interface and approach that had worked so well elsewhere, but was "humbled" by the lack of uptake.

Why did a Google that had worked well elsewhere not succeed in China?

Kai-Fu argued that two unusual features of Chinese internet users, their youth and their language, appears to explain the difference.

While showing examples of sites that did succeed in China, Kai-Fu argued that the young, novice, hurried average Chinese user is looking less for navigation or one authoritative sources of information and more for entertainment and broad surveys of information.

Google China was optimized for finding the one site you need to go to, as it is elsewhere, but, Kai-Fu said, according to eyetracking studies and log data, Chinese users tend to be much less task-oriented, read much more of the page, and click many more links than US users.

So, Google started to offer up more opportunities for exploration and discovery -- by being much more aggressive with query suggestions and universal search, for example, and by adding browse links to the Google front page rather than just having a stark page with just a search box -- rather than just trying to give a quick answer and let people move on.

Kai-Fu suggested that the Chinese interest in browsing and exploration also may have roots in the Chinese language itself. Chinese is slower and more work to type than other languages, but more compact to read. This seems to cause a preference for clicking over typing and a preference for pages dense with information over the sparsity Google tended to favor in other countries.

One curious question that Kai-Fu raised was whether these preferences will remain true over time. Expert internet users tend to be more task-oriented than novice users. Google China has had much more success in gaining market share in China among expert users.

It may be the case that, as people gain more experience with the Web and what is available on the Web, their behavior shifts from browse to search, from exploration of what is out there to finding the information they know must exist.

Update: Paul Heymann at Stanford Infolab posts an excellent summary of Kai-Fu Lee's talk, as well as notes on other talks at SIGIR.

Tuesday, July 15, 2008

Learning diversity when learning to rank

Filip Radlinski, Robert Kleinberg, and Thorsten Joachims have a ICML 2008 paper, "Learning Diverse Rankings with Multi-Armed Bandits" (PDF), that attempts to "directly learn a diverse ranking of documents based on users' clicking behavior."

An excerpt:

[We] show how clickthrough data can be used to learn rankings maximizing the probability that any new user will find at least one relevant document high in the ranking .... [even though] web queries often have different meanings for different users.

We propose an online learning approach for learning from usage data. As training data is being collected, it immediately impacts the rankings shown ... The goal is to minimize the total number of poor rankings displayed over all time.

The work appears to be largely theoretical due to very long convergence times -- sadly, investigating "how prior knowledge can be incorporated ... to improve the speed of convergence" is left to future work -- but still is a worthwhile and enjoyable read.

Please see my previous post, "Actively learning to rank", that discusses a fun KDD 2007 paper also by Filip Radlinski and Thorsten Joachims on learning to rank from click data.

Update: Two years later, Filip publishes a paper that appears to have a more practical and scalable technique for learning diversity from click data, "Learning optimally diverse rankings over large document collections". More nice work there from Filip.

Automatic optimization on large Hadoop clusters

Chris Olston, Benjamin Reed, Adam Silberstein, and Utkarsh Srivastava at Yahoo Research had a USENIX 2008 paper, "Automatic Optimization of Parallel Dataflow Programs" (PDF), that looks at optimizations for large-scale Hadoop clusters using Pig.

The paper says that it is only attempting "to suggest some jumping-off points for [optimization] work." With much of the paper spending a fair amount of time on "textbook ... optimizations" such as early projection and filtering, operator rewrites, and physical execution plans for joins, parts do read like a survey.

The part that looks at work sharing between jobs, however, goes beyond a survey and is quite interesting. I was particularly intrigued by the discussion of materialized views, which the authors suggest could be used to cache the results of common operations. One common early computation in these cluster is to extract a few fields from a file and then sort the extract based on one or more of the fields before doing additional processing. If that were cached, this could look a bit like lazily creating indexes over the data the first time they are needed. A fun idea, and one that could narrow the gap between Map-Reduce databases and more traditional databases.

One other interesting tidbit in the paper is that the authors argue that high level languages like Pig for programming large-scale parallel computations will become dominant over "direct Map-Reduce or Dryad programs" in a "mass migration" as "good optimization technology" is implemented, "analogous to the migration from assembly to C-style languages to Java-style languages."

Please see also a paper by some of the same authors, "Parallel Evaluation of Composite Aggregate Queries" (PDF). That paper looks at reducing expensive data combining operations in Hadoop clusters by implementing a "cross-node data redistribution strategy that takes into account the nested structure of a query ... such that aggregations can be generated locally" for a type of aggregation query.

Please see also my earlier post, "Hadoop and scheduling", that discusses another paper by some of the same authors and its look at a method of combining many jobs on a Hadoop cluster that all want to access the same raw data files.

Social peer-to-peer and Tribler

I finally got to a fun paper I have been meaning to read for some time, "Tribler: A social-based peer-to-peer system" (PDF).

What is interesting about the paper is that it proposes a combination of social networking and P2P for content sharing that not only is designed to produce recommendations from friends and users with similar tastes in the network, but also uses those friend relationships to speed downloads, reduce free-riding, and encourage good behavior.

An excerpt:

Most current P2P file-sharing systems treat their users as anonymous, unrelated entities ... [Tribler is] a novel social-based P2P file-sharing paradigm that exploits social phenomena by maintaining social networks and using these in content discovery, content recommendations, and downloading.

Tribler performs content discovery and recommendation based on the notion of taste buddies, that is, users with the same tastes or interests .... The ... interface facilitate[s] the formation of social groups ... [which] may help reduce anti-social behavior.

A user invokes the help of his friends to speed up downloads ... Peers contribute their bandwidth by joining a swarm even if they are not interested in the content being distributed in this swarm ... [Past work assumed] sufficient altruism ... [which] makes ... [it] impractical ... Our [approach] solves this problem by introducing social-group incentives.

Tribler is an open source project. More information at tribler.org, especially on their research page.

Monday, July 14, 2008

Going to SIGIR

I will be at SIGIR 2008 next week in Singapore. If you make it to what is very nearly the antipode for those of us in the US, please say hello!

Monday, July 07, 2008

Google Toolbar data and the actual surfer model

There were a few interesting developments in the past couple weeks from Google that appear to relate to their ability to track much of the movements on the web.

As MG Siegler from VentureBeat noted, among many others, Google Trends now offers a feature that shows the traffic many websites get, much like Alexa does using data from the Alexa toolbar. The new functionality also shows similar pages and queries, people who visited site X also visited Y and people who visited site X also searched for Y.

Danny Sullivan, when talking about similar features Google launched in a new tool called Google Ad Planner, wrote:

I specifically asked ... [if] the [Google] toolbar is NOT [being used], and [a] "secret sauce" reply ... is all I got.

That makes me think that toolbar data IS being used. In particular, the focus on Google Analytics data feels like a sideshow. Google can't rely on Google Analytics as a core data source for this information, because of the simple reason that not every site runs it. In contrast, using Google Toolbar data would give them a nearly complete sample of all sites out there.

Erick Schonfeld at TechCrunch followed up with a post, "Is Google Ad Planner Getting Its Data From The Google Toolbar?

There isn't much new about using this data in the way Google Trends has revealed. Alexa and others have been doing it with their toolbars for many years. What is new is that Google Toolbar is installed much more widely, including on every Dell computer and in every installation of Adobe Flash.

I have no information about whether Google is using their toolbar data, but I have a hard time believing they could resist it. Not only can it be used for things like Google Trends, but it could have a dramatic impact in core web search.

PageRank, the original core of Google's search relevance, is an analysis of the links between websites that simulates someone randomly surfing across the links, the so-called random surfer model.

With toolbar data, you no longer needs this approximation of a random surfer model. You have the actual surfer model.

You know exactly how people move across the web. You know which sites are popular and which sites are never visited. You know which links are traversed and how often. You know everything about where people go and what people want.

Data like that should allow tremendous advances in relevance. It is hard for me to believe that Google would not be using their toolbar data, not just for Google Trends, but also in search and advertising.

Please see also my earlier post, "Ranking using Indiana University's user traffic", which discusses a paper from WSDM 2008 that attempts to supplement the PageRank random surfer model with an actual surfer model.

Black, white, and gray spam

Scott Yih, Robert McCann, and Alek Kolcz had a paper at CEAS 2007 on "Improving Spam Filtering by Detecting Gray Mail" (PDF).

The paper focuses on training a classifier to detect unwanted e-mail, which they called gray mail, but what excites me more is the broader idea of shades of gray in what we think of as spam. The ideas behind the paper seem to suggest that spam should be more of a continuum. Documents vary from annoying to interesting, with much falling in the middle.

Part of my motivation here comes from feeling bothered for some time at what seems to me to be a weak distinction between spammy pages and useless pages.

If a web page exists but no one ever reads it, does it matter?

If a blog exists but has no subscribers, does it matter that it is not actually spam?

In general, whether some document is uninteresting because it is manipulative or uninteresting for another reason, isn't it still uninteresting?

If we do treat spam and uninteresting as similar, it seems like it could make the spam problem somewhat easier. A false positive on spam is much less costly if the penalized document was uninteresting anyway.

Please see also my August 2006 post, "Web spam, AIRWeb, and SIGIR".

Please see also Googler Matt Cutt's post, "Using data to fight webspam".