Showing posts with label algorithms. Show all posts
Showing posts with label algorithms. Show all posts

Monday, January 03, 2011

StackOverflow and Scraper Sites

I recently noticed that Google searches were turning up a lot of sites that mirror StackOverflow content as opposed to the originals. It appears that I'm not alone. This morning Jeff Atwood blogged about how they're having increasing problems with these sites receiving higher Google rankings. His post, and especially the comments, are filled with righteous indignation about how it's the end of the Internet as we know it. Of course, I can't remember a time when search results weren't polluted faked-out results, so I don't understand why he's so surprised.

But I do think this situation is different. The difference is that in many cases the presentation is far different from my previous exposure to link farms. Historically, my impression is that pages on such sites generally have the following attributes:

  1. The recycled content is poorly formated and often truncated
  2. Unrelated content from multiple sources is lumped together in a single page (usually there is some common keyword)
  3. Advertisements are extremely dominant and often unrelated to the content
  4. The link-to-content ratio is often very high, and the links are to unrelated content

In fact, I think if one were to develop a set of quantitative metrics that could be automatically measured for pages based on the above criteria, I think StackOverflow would perform worse than some of the sites that mirror its content. It's rather ironic, but if you think about it, if you wanted to defeat an algorithm that was developed to find the "best" source for some common content, you'd do everything you could to make your scraper site look more legitimate than the original. Let's compare a question on StackOverflow with it's cousin on eFreedom.com.

Analysis

The primary content is essentially the same with similar formatting. The two major differences are that eFreedom only directly displays the selected answer, as opposed to all of the answers with the selected answer on the top, and none of the comments are displayed. This may help avoid triggering the "unrelated content" rule, because the selected answer is probably the most cohesive with the question, and comments frequently veer off-topic. But I suspect the affect is minimal.

Now consider the advertisements. eFreedom has a few more, but they are more closely tied to the content (using Google AdWords, which probably helps). The advertisements on StackOverflow are for jobs identified via geolocation (I live in Maryland), and the words in them don't correlate particularly well to the primary content, even though they are arguably more relevant to StackOverflow users that run-of-the-mill AdWords ones.

Now let's consider the related links. StackOverflow has links to ~25 related questions in a column along the side. The only content from the questions in the title, and the majority seem to be related based on matching a single tag from the question. eFreedom, on the other hand has links to 10 related questions (they appear to be matching on both tags), puts them inline with the main content, and includes a brief summary. As a human being I think the StackOverflow links are much more noisy and less useful. If I try to think like an algorithm, what I notice is StackOverflow has a higher link-to-content ratio, and the links are to more weakly related content.

The only other major difference is that eFreedom has a "social network bar" on the page. I'm not sure how this would affect ranking. It probably helps with obtaining links.

If you look at the HTML for each page, both use Google Analytics, both use what I'm assuming is a CDN for images, and StackOverflow appears to have a second analysis service. On casual inspection, neither appear to be laden with external content for tracking purposes, although without deeper inspection there's no way to be sure. But I presume a having a large number of tracking links would make a site look more like spam, and having few of them make it look more legit.

Conclusion

I don't think either site looks like spam, but between the two, StackOverflow has more spam-like characteristics. eFreedom's content and links are considerably less noisy than StackOverflow's. Is eFreedom being a leach? Yes, certainly, but it, and I believe some of the other sites replicating StackOverflow's content, don't look like traditional link-spam sites. In fact, for a person who is just looking for content, as opposed to looking to participate in a community, then eFreedom is at least as good, if not slightly better. There may be a moral argument that the original source should be given priority over replications, but from a pure content quality perspective StackOverflow and its copiers are essentially identical, and computer algorithms aren't generally in the business of making moral judgements. Also, there are many forums and mailing list archives out there that have atrocious user interfaces where the casual searcher is likely better off being directed to a content aggregator than to the original source, so I don't think a general rule giving preference to original sources would be productive. Ultimately, I think open community sites like StackOverflow are going to have to compete with better SEO and perhaps better search and browsing UI's for non-contributors, rather than relying up search engines to perform some miracle, because the truth is that from a content consumption perspective the replicated sites are just as good.

Sphere: Related Content

Sunday, August 03, 2008

The Costs of Generality

I've been pondering the results of the Cedric's Code Challenge, and wondering just how much benefit is derived from optimized, purpose-specific solutions as opposed to solutions that rely on a more general libraries or frameworks.  It's fairly common to see debates where one person (or group) insists that general constructs from a standard library or other such solutions represent an unacceptable overhead, and the other side claims that the overhead is meaningless compared to runtime optimizations performed by HotSpot and the cost of programmer time.  These debates can be rather painful to watch, as both sides generally have good points, yet often seem to be arguing right past one another.  Consequently, I think a little exploration of various potential optimizations and what their respective impacts on performance would be beneficial.

For purposes here, I'm going to say that a solution based on a general framework would be one that uses a general purpose library to generate permutations of digits, filters out the ones with a leading zero, converts the permutations to numbers, and then collects the desired statistics.  A purpose specific solution would be one such as Crazy Bob's that is tailor-made for generating numbers based on permuted digits.

The General Solution

I'm not aware of a combinatronics library for Scala, but it is simple enough to write a generic permutation generating function:

  def permute[E](s: Set[E], n: Int)(f: List[E] => Unit): Unit = {
    def p(s: Set[E], r: List[E]): Unit =
      if (r.length == n) f(r) else for(e <- s) p(s - e, e :: r)
    p(s, Nil)
  }

This recursively generates all of the possible permutations.  When it as generated a complete permutation, it passes it to the function specified by the caller.  If s is an ordered set, then the permutations will be generated in a predictable order. This can then be used to generate the permutations of digits for the code challenge, as follows:

  val digits = TreeSet(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L)
  def main(args: Array[String]): Unit = {
    val start = java.lang.System.nanoTime
    var last, cnt, jumpLow, jumpMagnitude = 0L
    for(d <- 1 to 10) permute(digits, d) { p =>
      val xs = p.reverse // digits are generated in the wrong order so must be reversed
      if (xs.head != 0L) {  // make sure this is a valid set of digits
        val cur = xs.foldLeft(0L)((z, a) => (z * 10L) + a)
        val dif = cur - last
        if (dif > jumpMagnitude) {
          jumpLow = last
          jumpMagnitude = dif
        }
        last = cur
        cnt = cnt + 1L
      }

    }
    val end = java.lang.System.nanoTime
    println("Count: " + cnt)
    println("Jump: " + jumpMagnitude + " (from " + jumpLow + " to " + (jumpLow + jumpMagnitude) + ")")
    println("Time: " + ((end - start) / 1000000L) + " ms")
  }

This solution takes about 13 seconds on my MacBook.

Generate Only Valid Permutations

The above permutation function can be tweaked as follows to generate only valid permutations (ones without the leading zero), and thereby saving about 10% execution time.

  def digitPermute(n: Int)(f: List[Long] => Unit): Unit = {
    def p(s: Set[Long], r: List[Long]): Unit =
      if (r.length == n) f(r) else for(e <- s) p(s - e, e :: r)
    for(first <- (digits - 0L)) p(digits - first, first :: Nil)
  }

The above solution executes in about 12 seconds.

Accumulating Numbers Instead of Lists

Both of the methods above construct lists of numbers which are later assembled into numbers.  This wastes memory and cycles, because only the resulting numbers are required and they can be accumulated much more efficiently.  Doing so, as shown below, reduces execution time to about 7 seconds.

  val digits = TreeSet(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L)
  def digitPermute(n: Int)(f: Long => Unit): Unit = {
    def p(s: Set[Long], d: Int, r: Long): Unit =
      if (d == n) f(r) else for(e <- s) p(s - e, d + 1, r * 10L + e)
    for(first <- (digits - 0L)) p(digits - first, 1, first)
  }

Long Set without Boxing

The above implementations all use TreeSet from Scala's standard library, which imposes a few performance penalties. For one, it is "generic." This means that it requires both type-erasure and boxing instead of using primitives.  Second, if you look carefully at the definition of TreeSet, you'll notice that it doesn't require its contents to be Ordered, but rather uses a (potentially implicit) view converting the contained type into an Ordered.  This adds an extra layers of indirection and therefore an extra cost.

  final class LongSet (val contents: Array[Long]) {
    private def indexOf(v: Long, min: Int, max: Int): Int = {
      if (min > max) -1
      else {
        val mid = (min + max) >>> 1
        val midVal = contents(mid)
        if (midVal < v) indexOf(v, mid + 1, max)
        else if (midVal > v) indexOf(v, min, mid - 1)
        else mid
      }
    }
    def foreach(f: Long => Unit) {
      var i = 0
      val max = contents.length
      while (i < max) {
        f(contents(i))
        i = i + 1
      }
    }
    def -(v: Long): LongSet = {
      val max = contents.length - 1
      if (indexOf(v, 0, max) < 0) this
      else {
        val a = new Array[Long](max)
        var i, j = 0
        while (i <= max) {
          val cur = contents(i)
          if (cur != v) {
            a(j) = contents(i)
            j = j + 1
          }
          i = i + 1
        }
        new LongSet(a)
      }
    }
  }
  val digits = new LongSet(Array(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L))
  def digitPermute(n: Int)(f: Long => Unit): Unit = {
    def p(s: LongSet, d: Int, r: Long): Unit =
      if (d == n) f(r) else for(e <- s) p(s - e, d + 1, r * 10L + e)
    for(first <- (digits - 0L)) p(digits - first, 1, first)
  }

This implementation brings the execution time down to ~1.7 seconds, representing a substantial savings over TreeSet. The comparison isn't quite fair, as TreeSet uses a Red-Black balanced tree and the code above uses a sorted array, but the difference is still substantial and shows that having a more targeted data structure can improve performance significantly.  At this point you might be thinking "Well no sh*t sherlock!  Of course a data structure tuned for a specific type is faster than one that is written to handle any type!"  That's true...to a point.  Not all languages implement generics using type erasure and require boxing of values within parameterized classes.  For example, C++ was designed to ensure that data structures implemented using templates imposed little or no overhead above more raw ones.

Special Purpose Set for Permutation Generation

Another approach is to use a more special-purpose data structure in the permutation function without reducing its generality.  The linked set used in Crazy Bob's solution can be generalized to generating permutations of any type.  Unfortunately, this structure is mutable, and mutates on every invocation.  This means that while it would be possible to pass it directly to client code, it would be extremely dangerous because the client code may maintain a reference to the rapidly changing data structure.  Consequently, the structure needs to be copied into a list or similar structure before being passed to client code.  The solution built around the code below completes in ~5 seconds, which is slower than using an structure explicitly coded for dealing with longs and generating longs, but over twice as fast as generating permutations using the standard TreeSet class.

  private final class Element[E](val value: E, var next: Element[E], var previous: Element[E]) {
    /** remove an element from the set*/
    def use() {
      if (previous ne null) previous.next = next
      if (next ne null) next.previous = previous
    }
    /** put an element back in the set */
    def yieldValue() {
      if (previous ne null) previous.next = this
      if (next ne null) next.previous = this
    }
  }
  private def buildElements[E](s: Set[E]): Element[E] = {
    val iter = s.elements
    val first = new Element(iter.next, null, null)
    var cur = first
    while(iter.hasNext) {
      cur.next = new Element(iter.next, null, cur)
      cur = cur.next
    }
    first
  }
  def permute[E](s: Set[E], n: Int)(f: List[E] => Unit): Unit = {
    def p(start: Element[E], head: Element[E], r: List[E]): Unit = {
      def take(current: Element[E]): Unit = {
        if (current ne null) {
          val newR = current.value :: r
          if (newR.length == n) {
            f(newR)
            take(current.next)
          } else {
            current.use()
            val newHead = if (current eq head) head.next else head
            p(newHead, newHead, newR)
            current.yieldValue()
            take(current.next)
          }
        }
      }
      take(start)
    }
    val first = buildElements(s)
    p(first, first, Nil)
  }

Conclusion

The various implementations here represent a sampling of various ways that Cedric's Code Challenge can be implemented in Scala, and the effects they have on performance.  A relatively direct port of Crazy Bob's solution to Scala completes in ~0.4 seconds, making it by far the fastest solution and about 30 times faster than the solution using standard data structures with a generic permutation generator.  That's not really surprising, so what can we conclude?  The most obvious conclusion is that avoiding the construction of intermediate objects yields a substantial speedup.  This can be seen in two places.  The first is in the switch from constructing a List to represent the permutation to accumulating the Long directly.  The second is in using a special-purpose mutable data structure to generate the permutations, thereby avoiding repeated allocations of Set objects.  Finally, reducing overhead due to things like boxing and the casts associated with type erasure does make a noticeable difference in performance.  On the flip side, Scala's closure based constructs, such as nested functions and for loops, added negligible overhead, if any at all. Using more general constructs instead of more specific ones clearly has a substantial performance cost, but it's also worth mentioning that the cost is trivial compared to the benefit received in the transition from a brute-force solution to an optimal algorithm.

Sphere: Related Content

Saturday, June 28, 2008

Cedric's Code Challenge

Cedric Buest issued a interesting little challenge this morning:

Here is an interesting coding challenge: write a counter function that counts from 1 to max but only returns numbers whose digits don't repeat.

For example, part of the output would be:

  • 8, 9, 10, 12 (11 is not valid)
  • 98, 102, 103 (99, 100 and 101 are not valid)
  • 5432, 5436, 5437 (5433, 5434 and 5435 are not valid)

Also:

  • Display the biggest jump (in the sequences above, it's 4: 98 -> 102)
  • Display the total count of numbers
  • Give these two values for max=10000

He welcomed brute-force solutions, but really the challenge here is in coming up with something more efficient and elegant. There are basically three general approaches:

  1. Run through all the numbers from 0 to n, test each for no repeating digits, and track the above statistics while you do it. This is the brute force method.

  2. Permute the digits in a fashion that generates numbers sequentially and track the statistics. Alternatively you could generate them in any order, sort them, and then calculate the statistics.

  3. Derive a heuristic that proves a given sequence of numbers will all contain repeated digits and can therefore be skipped.

I think #2 is probably the ideal fashion, but I didn't think of it until I was mostly done coding #3.

Finding Repeated Digits

The first step in solving this problem, no matter what the approach, is to come up with an algorithm to detect repeated digits. Commenters on Cedric's blog came up with a number of ways to do this, most of which centered around converting the integer into a string and then finding repeated characters. This is a rather frighteningly inefficient approach. There is certainly no need to convert the number into a string in order to know its digits. A much simpler approach is allocate a ten element array of booleans initialized to false, and start generate the digits from lowest to highest by moding the number by ten. The first time you encounter a digit, you flip it's associated array element to true. The second time, you exit because you have detected a repeat. The array is essentially serving as a thrifty man's map. Here it is in Scala:

  def containsRepeatedDigit(i: Long): Boolean = {
    val digits = new Array[Boolean](10) // elements default to false
    def f(i: Long): Boolean = {
      if (i == 0L) false // all digits have been processed
      else {
        val d = (i % 10L).asInstanceOf[Int]
        if (digits(d)) true
        else {
          digits(d) = true
          f(i / 10L)
        }
      }
    }
    if (i < 11L) false else f(i)
  }

The Heuristic

Consider the number 2201. It has repeating digits. The question is: What's the next number without repeating digits? It is 2301. You could calculate it using brute-force, but you'd end up scanning an extra 99 numbers. Notice that the repetition is in the upper digits. This means that you will cannot get a number with non-repeating digits until the second digit (counting from the left) changes. Now consider the number 2200. In this case changes need to occur in both the lower digits and the upper digits, however addressing the upper digits allows us to skip a much larger section of the search space. Finally, consider 22200. In this case, you still want the second digit. However, you are searching from the right, so algorithms what detect the first repeat won't work. Here's Scala code to find the appropriate digit. Notice that it looks very similar to the repeated digit test above.

  def max(array: Array[Int]): Int = {
    def f(idx: Int, m: Int): Int = {
      if (idx == array.length) m
      else if (array(idx) > m) f(idx + 1, array(idx))
      else f(idx + 1, m)
    }
    f(1, array(0))
  }

  def repeatedDigit(i: Long): Int = {
    val prevIdx = new Array[Int](10)
    val recentIdx = new Array[Int](10)
    def f(i: Long, idx: Int) {
      if (i > 0) {
        val d = (i % 10L).asInstanceOf[Int]
        if (recentIdx(d) > 0) prevIdx(d) = recentIdx(d)
        recentIdx(d) = idx
        f(i / 10L, idx + 1)
      }
    }
    f(i, 1)
    Math.max(max(prevIdx), 0)
  } 

Now that we have an algorithm for finding the highest digit that needs to be changed, we need one that will take that information and turn it into the next possible number containing no repeating digits. This simply requires basic arithmetic.

  def nextPossibleNonRepeating(i: Long): Long = 
        nextPossibleNonRepeating(i, repeatedDigit(i))

  def nextPossibleNonRepeating(i: Long, idx: Int): Long = {
    if (idx == 0) i + 1L
    else {
      val p = Math.pow(10.0, (idx - 1).asInstanceOf[Double]).asInstanceOf[Long]
      val r = i % p
      val d = p - r
      i + d
    }
  }

Given this, it is easy to generate a sequence:

  def nextNonRepeating(i: Long): Long = nextNonRepeating(i, repeatedDigit(i))
  def nextNonRepeating(i: Long, idx: Int): Long = {
    val p = nextPossibleNonRepeating(i, idx)
    val d = repeatedDigit(p)
    if (d == 0) p else nextNonRepeating(p, d)
  }

Solution

Once this is all done, the solution is pretty straight-forward. It takes the general form of the function use to generate the next number with non-repeating digits, only it has to keep track of a bunch of extra information.

  def printNonRepeatingReport(start: Long, stop: Long, last: Long, gapLow: Long,
                              gapHigh: Long, cnt: Long): Unit = {
    if (start > stop) {
      println("max: " + last)
      println("max gap: " + (gapHigh - gapLow) + " between " + 
              gapLow + " and " + gapHigh)
      println("count: " + cnt)
    } else {
      val d = repeatedDigit(start)
      if (d == 0L) {
        //println(start)
        val gap = start - last
        val (gl, gh) = if (gap > (gapHigh - gapLow)) (last, start) 
                       else (gapLow, gapHigh)
        printNonRepeatingReport(start + 1L, stop, start, gl, gh, cnt + 1L)
      } else {
        printNonRepeatingReport(nextPossibleNonRepeating(start, d), stop, last, 
                                gapLow, gapHigh, cnt)
      }
    }
  }

Results

I'm not going to list all the numbers here, just the statistics for 1-10,000:

  • max: 9,876
  • max gap: 105 between 1,098 and 1,203
  • count: 5,274

Of course I haven't checked it against a brute-force solution or other posted solution, so I owe a beer/coffee/tea, depending on your persuasion, to anyone who can point out a bug and provide the solution.

Just for kicks, here's to 10,000,000,000:

  • max: 9,876,543,210
  • max gap: 104,691,357 between 1,098,765,432 and 1,203,456,789
  • count: 8,877,690

Which took 1 minute 23 seconds on my MacBook. Try that with a brute-force approach.

Sphere: Related Content