Sunday, January 16, 2011

Martin Odersky's Scala Levels

I just saw Martin's post on levels of expertise in Scala, and Tony Morris's response. The concept of levels is something that many in the community have called for repeatedly, in order to help ease people - especially team members who may be a little put off by some of Scala's more advanced features - into the language more gently. I understand the motivation, but it has always struck me as somewhat odd. The reason is that I think one needs to have very compelling reason for such a significant change to be justified, especially a change from a mature, mainstream technology such as Java into one as new an immature as Scala. To being, I'll recount my experience with adding Python as a stable in my programming toolbox.

Learning Python

I dabbled it Python a lot before really using it extensively. Much of my initial usage basically amounted to using Jython as a repl for interactively poking at Java libraries, especially ones that weren't particularly well documented. I thought Python was neat, but at the time most of my programming experience was in C++ and Java. I valued static typing. I valued having a compiler. When I was in college, I had a professor who said something along the line so of "The compiler is your friend. Write your code so that the compiler is more likely to catch your mistakes. Don't try to trick it or bypass it." This was in the context of C++ programming, and given all the ways that one can shoot himself in the foot using C++, it always seemed like sage advice. So giving up have a compiler to catch my mistakes, and an explicit structure clearly present in my code, seemed like an awful lot to ask. Sure, Python is more concise, but in my opinion conciseness is often more a matter of how closely the libraries you're using match the problem that you are trying to solve. About a year ago I had some experimental Python code that I had thrown together that did some regular expression heavy text processing then some XML processing (I was processing output of an application that produced invalid XML, so the first phase was the fix it so the XML parser wouldn't crash and the second was to actually process it). I can't remember why, but I decided to convert it to Java. I think it was because I wanted to integrate it with something that integrated well with Java, but not as well with Python. Anyway, the point is that the Python code and the Java code weren't that different in length. Sure, Java has a lot more boiler plate, but once your code starts being dominated by actual logic instead of getters and setters, it's the libraries that make the difference.

Anyway, I didn't really get into Python until I discovered and thoroughly learned metaclasses and descriptors. In my mind, they took Python from being a toy with a clean syntax to a language that opened up an entire new dimension of abstraction. A dimension of abstraction that had been missing all me life. A dimension that I had glimpsed in C++ template metaprogramming but had never quite managed to enter. All of a sudden I had this tool that enabled me to drastically and repeatedly reduce code sizes all while making the business logic cleaner and more explicit. I had a means to encode my ideas such that they enabled my coworkers to use them without fulling understanding them (an admittedly dangerous proposition). It was beautiful! They remain an important part of my toolbox to this day.

Incidentally, my first uses metaclasses and descriptors was to concisely implement immutable, lazy data structures. This was long before I knew anything about functional programming.

Learning Scala

I also dabbled in Scala a lot before really becoming enamored with it. I was caught by the conciseness over Java, especially the absence of getters and setters, the presence of operator overloading, and multiple inheritance using traits. But I had operator overloading and multiple inheritance in both C++ and Python. I could usually abstract away explicit getters and setters in Python, and by the time I started picking up Scala I had already made a significant shift towards immutability and laziness, so setters weren't as necessary. Scala was neat, and the compatibility with Java was a real benefit, but at first it didn't seem compelling. In fact, when I first started experimenting with Scala it seemed too buggy to really be worth it. Something worth watching, but not yet worth using. Also, by the time I started learning Scala I had already learned Common Lisp and functional programming with full tail-call optimization seemed awkward (and still does).

What eventually sold me on Scala was higher-kinded types, companion objects, and objects-as-modules. You see, but the time I started picking up Scala I had already developed a habit of using a lot of generic programming. This is easy in Python where there's no compiler check a type system to tell you that you are wrong. But a lot of the patterns I commonly used couldn't be translated into Java generics (in fact, in my first foray into Java generics I crashed javac writing code that was compliant with the Java specification). Scala gave me concise programs and type safety! And with the introduction of collections that actually use higher-kinded types in 2.8.0, along with the leaps and bounds made in robustness, I think it finally reached threshold for serious use.

That's not to say these are the only key features. Understanding folds was a real eureka moment for me. I'm not sure how one can do effective functional programming without them, but perhaps there is a way. Implicit definitions are also key, and they are used so extensively in Scala libraries that I think writing Scala without a decent understanding of them is nothing short of voodoo programming. In short, there's a lot, and the features tend combine together in powerful ways.

Summary

I understand the motivation for establishing levels, but I personally think if one avoids many of the features that Martin has flagged as advanced, then the benefits of Scala don't justify the risk of using such a new technology and the need to retrain people. I think without them one is better off finding a better library or framework for their current stack. Also, between the fact that advanced features are omnipresent in the libraries, and teams will always have someone who wants to use the more powerful features, avoiding them works a lot better in theory than it does in practice. You can't not use them, unless you don't use Scala's standard library and avoid almost any library written explicitly for Scala. You can only pretend that you are not using them.

This is probably going to get me flamed, but reward must justify risk. In many circumstances Scala has the features to justify the risk associated with it, but the justification quickly vanishes as the more powerful features are stripped away. So while I think a programmer may be productive at Martin's lower levels, they don't justify Scala over Java.

Sphere: Related Content

Wednesday, January 12, 2011

Types of Project Failure

I saw this question on StackExchange and it made me think about all the projects I've been on that have failed in various ways, and at times been both declared a success and in my opinion been a failure at the same time (this happens to me a lot). Michael Krigsman posted a list of six different types a while back, but they are somewhat IT centric and are more root causes than types (not that identifying root causes isn't a worthwhile exercise - it is). In fact, when if you Google for "types of project failures," the hits in general are IT centric. This may not seem surprising, but I think the topic deserves more general treatment. I'm going to focus on product development and/or deployment type projects, although I suspect much can be generalized to other areas.

Here's my list:

  1. Failure to meet one or more readily measurable objectives (cost, schedule, testable requirements, etc)
  2. Failure to deliver value consummate with the cost of the project
  3. Unnecessary collateral damage done to individuals or business relationships

I think these are important because in my experience they are the types of failure that are traded "under the table." Most good project managers will watch their measurable objectives very closely, and wield them as a weapon as failure approaches. They can do this because objective measures rarely sufficiently reflect "true value" to the sponsor. They simply can't, because value is very hard to quantify, and may take months, years, or even decades to measure. By focusing on the objective measures, the PM can give the sponsor the opportunity to choose how he wants to fail (or "redefine success," depending on how you see it) without excessive blame falling of the project team.

Subjective Objective Failure

The objective measures of failure are what receive the most attention, because, well, they're tangible and people can actually act upon them. My definition of failure here is probably a bit looser than normal. I believe that every project involves an element of risk. If you allocate the budget and schedule for a project such that is has a 50% chance of being met - which may be a perfectly reasonable thing to do - and the project comes in over budget but within a reasonably expected margin relative to the uncertainty, then I think the project is still a success. The same goes for requirements. There's going to be some churn, because no set of requirements is ever complete. There's going to be some unmet requirements. There are always some requirements that don't make sense, are completely extraneous, or are aggressive beyond what is genuinely expected. The customer may harp on some unmet requirement, and the PM may counter with some scope increase that was delivered, but ultimately one can tell if stakeholders feel a product meets its requirements or not. It's a subjective measure, but it's a relatively direct derivation from objective measures.

Value Failure

Failure to deliver sufficient value is usually what the customer really cares about. One common situation I've seen is where requirements critical to the system's ConOps, especially untestable ones or ones that are identified late in the game, are abandoned in the name of meeting more tangible measures. In a more rational world such projects would either be canceled or have success redefined to include sufficient resources to meet the ConOps. But the world is not rational. Sometimes the resources can't be increased because they are not available, but the sponsor or other team members chooses to be optimistic that they will become available soon enough to preserve the system's value proposition, and thus the team should soldier on. This isn't just budget and schedule - I think one of the more common problems I've seen is allocation of critical personnel. Other times it would be politically inconvenient to terminate a project due to loss of face, necessity of pointing out an inconvenient truth about some en vogue technology or process (e.g. one advocated by a politically influential group or person), or idle workers who cost just as much when they're idle as when they're doing something that might have value, or just a general obsession with sunk costs.

This is where the "value consummate with the cost" comes in. Every project has an opportunity cost in addition to a direct cost. Why won't critical personnel be reallocated even though there's sufficient budget? Because they're working on something more important. A naive definition of value failure would be a negative ROI, or an ROI below the cost of capital, but it's often much more complicated.

It's also important to note here that I'm not talking about promised ROI or other promised operational benefits. Often times projects are sold on the premise that they will yield outlandish gains. Sometimes people believe them. Often times they don't. But regardless, and usually perfectly possible to deliver significant value without meeting these promises. This would be a case of accepting objective failure because the value proposition is still there, it's just not as sweet as it was once believed to be.

Collateral Damage

Collateral damage receives a fair amount of press. We've all heard of, and most of us have one time or another worked, a Death March project. In fact, when I first thought about failure due to collateral damage, I thought any project causing significant collateral damage would also fall under at least one of the other two categories. It would be more a consequence of the others than a type unto itself. But then I thought about it, and realized experienced a couple projects where I feel the damage done to people was unacceptable, despite the fact that in terms of traditional measures and value delivered they were successful. There's a couple ways this can happen. One is that there are some intermediate setbacks from which the project ultimately recovers, but one or more people are made into scapegoats. Another way arises from interactions among team members that are allowed to grow toxic but never reach the point where they boil over. In my experience a common cause is an "extreme" performer, either someone who is really good or really bad, with a very difficult personality, particularly when the two are combined on a project with weak management.

Now, you might be wondering what the necessary causes of collateral damage are. It's actually fairly simple. Consider a large project that involves a large subcontract. The project may honestly discover that the value contributed by the subcontract does not justify its cost, or that the deliverables of the subcontract are entirely unnecessary. In this case the subcontract should be terminated, which in turn is likely to lead to a souring of the business relationship between the contractor and the subcontractor, along with potentially significant layoffs at the subcontractor. Often times a business must take action, and there will be people or other businesses who lose out through no fault of their own.

Back to Trading Among Failure Types

A Project Manager, along with projects sponsors and other key stakeholders, may actively choose which type of failure, or balance among them, is most desirable. Sometimes this "failure" can even really be success, so long as significant value is delivered. Some of the possible trades include:

  1. Failing to meet budget and schedule in order to ensure value.
  2. Sacrificing value in order to meet budget and schedule...
  3. ...potentially to avoid the collateral damage that would be inflicted in the case of an overt failure
  4. Inflicting collateral damage through practices such as continuous overtime in order to ensure value or completing on target
  5. Accepting reduced value or increased budget/schedule in order to avoid the collateral damage of the political fallout for not using a favored technology or process

Ultimately some of these trades are inevitable. Personally, I strongly prefer focusing on value. Do what it takes while the value proposition remains solid and the project doesn't resemble a money pit. Kill it when the value proposition disappears or it clearly becomes an infinite resource sink. But of course I know this is rather naive, and sometimes preventing political fallout, which I tend to willfully ignore in my question for truth, justice, and working technology; has an unacceptable if intangible impact. One of my most distinct memories is having a long conversation with a middle manager about a runaway project, and at the end being told something like "Erik, I agree with you. I know you're right. We're wasting time and money. But the money isn't worth the political mess it would create if I cut it off or forcefully reorganized the project." I appreciated the honesty, but it completely floored me, because it meant not only that politics could trump money, but politics could trump my time, which was being wasted. Now I see the wisdom in it, and simply try to avoid such traps when I see them and escape them as quickly as possible when I trip over them.

Sphere: Related Content

Monday, January 03, 2011

StackOverflow and Scraper Sites

I recently noticed that Google searches were turning up a lot of sites that mirror StackOverflow content as opposed to the originals. It appears that I'm not alone. This morning Jeff Atwood blogged about how they're having increasing problems with these sites receiving higher Google rankings. His post, and especially the comments, are filled with righteous indignation about how it's the end of the Internet as we know it. Of course, I can't remember a time when search results weren't polluted faked-out results, so I don't understand why he's so surprised.

But I do think this situation is different. The difference is that in many cases the presentation is far different from my previous exposure to link farms. Historically, my impression is that pages on such sites generally have the following attributes:

  1. The recycled content is poorly formated and often truncated
  2. Unrelated content from multiple sources is lumped together in a single page (usually there is some common keyword)
  3. Advertisements are extremely dominant and often unrelated to the content
  4. The link-to-content ratio is often very high, and the links are to unrelated content

In fact, I think if one were to develop a set of quantitative metrics that could be automatically measured for pages based on the above criteria, I think StackOverflow would perform worse than some of the sites that mirror its content. It's rather ironic, but if you think about it, if you wanted to defeat an algorithm that was developed to find the "best" source for some common content, you'd do everything you could to make your scraper site look more legitimate than the original. Let's compare a question on StackOverflow with it's cousin on eFreedom.com.

Analysis

The primary content is essentially the same with similar formatting. The two major differences are that eFreedom only directly displays the selected answer, as opposed to all of the answers with the selected answer on the top, and none of the comments are displayed. This may help avoid triggering the "unrelated content" rule, because the selected answer is probably the most cohesive with the question, and comments frequently veer off-topic. But I suspect the affect is minimal.

Now consider the advertisements. eFreedom has a few more, but they are more closely tied to the content (using Google AdWords, which probably helps). The advertisements on StackOverflow are for jobs identified via geolocation (I live in Maryland), and the words in them don't correlate particularly well to the primary content, even though they are arguably more relevant to StackOverflow users that run-of-the-mill AdWords ones.

Now let's consider the related links. StackOverflow has links to ~25 related questions in a column along the side. The only content from the questions in the title, and the majority seem to be related based on matching a single tag from the question. eFreedom, on the other hand has links to 10 related questions (they appear to be matching on both tags), puts them inline with the main content, and includes a brief summary. As a human being I think the StackOverflow links are much more noisy and less useful. If I try to think like an algorithm, what I notice is StackOverflow has a higher link-to-content ratio, and the links are to more weakly related content.

The only other major difference is that eFreedom has a "social network bar" on the page. I'm not sure how this would affect ranking. It probably helps with obtaining links.

If you look at the HTML for each page, both use Google Analytics, both use what I'm assuming is a CDN for images, and StackOverflow appears to have a second analysis service. On casual inspection, neither appear to be laden with external content for tracking purposes, although without deeper inspection there's no way to be sure. But I presume a having a large number of tracking links would make a site look more like spam, and having few of them make it look more legit.

Conclusion

I don't think either site looks like spam, but between the two, StackOverflow has more spam-like characteristics. eFreedom's content and links are considerably less noisy than StackOverflow's. Is eFreedom being a leach? Yes, certainly, but it, and I believe some of the other sites replicating StackOverflow's content, don't look like traditional link-spam sites. In fact, for a person who is just looking for content, as opposed to looking to participate in a community, then eFreedom is at least as good, if not slightly better. There may be a moral argument that the original source should be given priority over replications, but from a pure content quality perspective StackOverflow and its copiers are essentially identical, and computer algorithms aren't generally in the business of making moral judgements. Also, there are many forums and mailing list archives out there that have atrocious user interfaces where the casual searcher is likely better off being directed to a content aggregator than to the original source, so I don't think a general rule giving preference to original sources would be productive. Ultimately, I think open community sites like StackOverflow are going to have to compete with better SEO and perhaps better search and browsing UI's for non-contributors, rather than relying up search engines to perform some miracle, because the truth is that from a content consumption perspective the replicated sites are just as good.

Sphere: Related Content

Tuesday, December 28, 2010

Easier object inspection in the Scala REPL

If you're like me, you spend a fair amount of time poking at objects of poorly documented classes in a REPL (Scala or otherwise...). This is great compared to writing whole test programs solely for the purpose of seeing what something does or what data it really contains, but it can be quite time consuming. So I've written a utility for use on the Scala REPL that will print out all of the "attributes" of an object. Here's some sample usage:

scala> import replutils._
import replutils._

scala> case class Test(a: CharSequence, b: Int)
defined class Test

scala> val t = Test("hello", 1)
t: Test = Test(hello,1)

scala> printAttrValues(t)
hashCode: int = -229308731
b: int = 1
a: CharSequence (String) = hello
productArity: int = 2
getClass: Class = class line0$object$$iw$$iw$Test

That looks fairly anti-climatic, but after spending hours typing objName. to see what's there, and poking at methods, it seems like a miracle. Also, one neat feature of it is that if the class of the object returned is different from the class declared on the method, it prints both the declared class and the actual returned class.

Code and further usage instructions is available on BitBucket here.

So what exactly is an attribute?

I haven't quite figured that out yet. Right now it is any method that meets the following criteria:

  1. the method has zero arguments
  2. the method is public
  3. the method is not static
  4. the method's name is not on the exclude list (e.g. wait)
  5. the method's return type is not on the exclude list (e.g. Unit/void)
  6. the method is not a "default argument" method
Please post any feedback in the comments here, or better yet in the issue tracker on Bitbucket.

Sphere: Related Content

Monday, December 27, 2010

Playing with Scala Products

Have you ever wanted something like a case class, that magically provides decent (for some definition of decent) implementations of methods like equals and hashCode, but without all of the restrictions? I know that I have. Frequently. And recently I ran into an answer on StackOverflow that gave me an idea for how to do it. The trick is to play with Products. I'm not sure how good of an idea it is, but I figure the Internet will tell me. Here's the basic idea:

import scala.runtime.ScalaRunTime

trait ProductMixin extends Product {
  def canEqual(other: Any) = toProduct(other) ne null
  protected def equalityClass = getClass
  protected def equalityClassCheck(cls: Class[_]) = equalityClass.isAssignableFrom(cls)
  protected def toProduct(a: Any): Product = a match {
    case a: Product if productArity == a.productArity && equalityClassCheck(a.getClass) => a
    case _ => null
  }
  override def equals(other: Any) = toProduct(other) match {
    case null => false
    case p if p eq this => true
    case p => {
      var i = 0
      var r = true
      while(i  getClass.getName()
  }
  override def hashCode = ScalaRunTime._hashCode(this)
  override def toString = ScalaRunTime._toString(this)
}

Basic Usage

There's a couple ways to use ProductMixin. The first, and probably the simplest, is to extends one of the ProductN classes and provide the appropriate defs:

class Cat(val name: String, val age: Int) extends Product2[String, Int] with ProductMixin {
  def _1 = name
  def _2 = age
}

The second way is to provide them yourself:

 class Dog(val name: String, age: Int) extends ProductMixin {
  def productArity = 2
  def productElement(i: Int) = i match {
    case 0 => name
    case 1 => age
  }
}

Either way, the result is something that has many of the benefits of case classes but allows for more flexibility. Here's some sample usage:

scala> val c1 = new Cat("Felix", 2)
val c1 = new Cat("Felix", 2)
c1: Cat = Cat(Felix,2)

scala> val c2 = new Cat("Alley Cat", 1)
val c2 = new Cat("Alley Cat", 1)
c2: Cat = Cat(Alley Cat,1)

scala> val c3 = new Cat("Alley Cat", 1)
val c3 = new Cat("Alley Cat", 1)
c3: Cat = Cat(Alley Cat,1)

scala> c2 == c3
c2 == c3
res0: Boolean = true

scala> c1 == c2
c1 == c2
res1: Boolean = false

Dealing with Inheritance

Equality in the presence of inheritance is very tricky. But, possibly foolishly, ProductMixin makes it easy! Let's say you have a hierarchy, and you want all the classes under some root class to be able to equal each other. Here's how it would be done by overriding the equalityClass so that it is the root of the hierarchy (using a not-so-good example):

scala> abstract class Animal(val name: String, val age: Int) extends Product2[String, Int] with ProductMixin {
     | override protected def equalityClass = classOf[Animal]
     | def _1 = name
     | def _2 = age
     | }
defined class Animal

scala> class Dog(name: String, age: Int) extends Animal(name, age)
class Dog(name: String, age: Int) extends Animal(name, age)
defined class Dog

scala> class Cat(name: String, age: Int) extends Animal(name, age)
class Cat(name: String, age: Int) extends Animal(name, age)
defined class Cat

scala> val c = new Cat("Felix", 1)
val c = new Cat("Felix", 1)
c: Cat = Cat(Felix,1)

scala> val d = new Dog("Felix", 1)
val d = new Dog("Felix", 1)
d: Dog = Dog(Felix,1)

scala> c == d
c == d
res0: Boolean = true

The reverse can also be accomplished by overriding equalityClassCheck such that it checks that the classes are equal instead of using isAssignableFrom. That would mean two objects could be equal if and only if they are the same class.

Wrapup

I don't know to what extent this is a good idea. I've only tested it a bit in the REPL. It's neat, but there is one problem that comes to mind: performance. Any primitive members that are elements of the product will end up being boxed prior to usage in the hashCode and equality methods, the extra layers of indirection aren't free, and neither is the loop. That being said, case classes use the exact same method for hashing, so in what way they pay the same penalty, and unless code is really hashing and equality heavy it probably wouldn't be noticeable.

Sphere: Related Content