Friday, February 22, 2013

The Secret Menu: In-N-Out Design

Panera bread gained some attention this week through the revelation of their "secret menu" of items not advertised on the menus in their stores.  West Coasters in the know are aware that In-N-Out has special ordering options that they do not make known, and were for a long time only available via word of mouth.  Even Starbucks has a little baby sized cup of coffee called the "Short" size.

Why do this?  If you offer a feature, why not tell people about it?  If you're In-N-Out burger, why have an insanely short and simple menu and then hide the deeper options?  And how do these principles apply to the software design world?

  • Creating insider information makes your customers/users part of an exclusive club.  Remember that cool tracking shot scene from Goodfellas (a movie with many great tracking shots) where Ray Liotta takes Karen through the back entrance of the Copacabana nightclub, introducing her to the servers and staff while being shown to a private table and skipping the line?  There's no way you could create that with a fast food joint or a piece of boring software, right?  Wrong.
  • These "power users" are far more likely to want to share their cool hack with their social circles.  
  • Word of mouth marketing from friends is orders of magnitude more effective than traditional advertising
  • In the age of Google, anyone who really wants to know can effortlessly find the information you're "hiding", so you're not really denying it to anyone.
  • Simplicity and minimal options are pillars of good UX design (even for restaurants).  Don't Make Me Think.  Put the options behind a curtain.

Friday, February 15, 2013

This Machine Kills Television

As asteroid DA 14 narrowly avoids earth this afternoon, it strikes me that the bulk of the serious coverage is not happening on network TV, but instead on NASA TV's online feed, JPL's Ustream feed, and various other online-accessible locations with niche content streaming feeds from radar stations across the world.  

Of course, all-day coverage of an asteroid would surely not be the most profitable content for the major networks today, so they're ceding this niche to various online players.  It wouldn't make sense for them not to.  The only trouble is, by the time the next almost killer asteroid swings by to say hello to Earth, almost everything will be niche content, and the traditional broadcast and cable networks will see continually decreasing interest in their model of curated linear content.  The combination of high speed internet access anywhere you go on mobile and connected devices with the liberation of content through an exploding number of online services means users will seek content based on their interests and social connections wherever the fewest barriers and interruptions exist.

Those of us in the media technology world are, to put it simply, building machines that kill television.  This got me thinking about the artistic value of literally applying that label to the machines we use to revolutionize television.

Woody Guthrie famously performed with a very unique guitar.  After observing the phrase "This Machine Kills Fascists" written on fighter airplanes during the Spanish Civil War, Guthrie took a "Pen is Mightier than the Sword" view and put the same phrase on his guitar.  This became a legend in music, and several artists have riffed on this concept.  Steve Earle (of HBO's Treme) comments on New Orleans and Katrina with "This Machine Floats", and Don't Forget to Be Awesome went with "This Machine Pwns N00bs":

Ok - my turn.  If I'm creating software that kills television, what's the most appropriate "guitar" with which to proclaim that message?

The mobile device that's driving forward "TV Anywhere"?

The MacBook Pro I use to create that software?

Or the connected TV device that's emboldened a new wave of cord-cutters?

Friday, February 8, 2013

How Catastrophic Failures in the Super Bowl, Online Video, and Nuclear Power Are Related

Failure is fascinating.

I think there's something ingrained in human nature (and particularly in engineers and designers) that draws us to observe failure and ask "How can I prevent this from happening to me"?  It's why we slow down to look at car wrecks, and why we're glued to TV news coverage of disasters.  People want to feel in control, and like to rationalize why that disaster they're observing won't happen to them.  Even watching celebrity train wrecks from Gary Busey to Honey Boo Boo provides an affirmation that "At least that's not happening to me."

In particular, I find myself fascinated by engineering failures.  As a kid, watching the footage of the collapse of the Tacoma Narrows Bridge with my Dad (also an engineer) instilled me with a sense that all of our technical accomplishments are fleeting and fragile.  The question reverberating in my mind for the rest of my life has been "How do we stop these disasters from happening?"

The strange thing is, that drive to prevent these kinds of failures is often itself the root cause of the catastrophe. 

My Catastrophe Story
As an adult now, I've seen the occasional engineering catastrophe up close, though usually resulting in failed video feeds, rather than collapsed bridges.  In "war room" or "control room" settings, I've felt the intense pressure of a major live video event going wrong, and needing to help fix it quickly while the pressure mounts and the audience disappears.  

During the very first internet broadcast of NBC's Sunday Night Football experience in 2009, our war room operation was in a panic.  Our real time analytics were showing a steep drop-off in viewership, and the origin servers streaming out the live video for the event were under incredible load.  It looked like a malicious 3rd party was executing a Denial of Service (DOS) attack against us.  As it turned out, we were our own attackers.

To handle the large load from a nation streaming such a big sporting event, we had configured a Content Delivery Network (CDN) to handle the traffic.  Individuals streaming the event would stream it from the CDN's servers, and the CDN's servers would get the stream from our origin servers.  As a mechanism to ensure that the CDN's servers were always up to date, they would issue requests to the origin servers whenever they needed content.  When they failed to get that content, they would ask at a greater frequency than usual.  At the same time, the configuration files for the online player were also hosted in the same manner.  As it turns out, the configuration of the player and CDN caused each configuration file request to go to the origin servers.  This created a lot of load on those servers, so they stopped responding to the CDN for video requests, which caused the CDN to request video chunks much more frequently than normal.  This led to a cascade of increased load - basically, a denial of service attack on our origin by our own CDN.

So, the system we'd designed to reduce load on our origins caused lots of load on our origins.

The good news was that we were able to keep the main feed up for the whole game for most users, with only secondary functionality taking a fatal hit.  By the next game, we had established a robust testing mechanism to determine proper cache offload configuration prior to the games, and things went relatively smoothly for the rest of the season.

The Blackout Bowl
With that experience in mind, I felt strong empathy for the electrical engineers and operators who dealt with the "Blackout Bowl" this past Sunday, as the largest television event of the year threatened to turn into a total disaster as the lights turned off during the middle of the Super Bowl.  I've been there, and it was painful to watch (even though it gave our Niners a much needed reprieve).

The funny thing is, the device that caused the power outage during the Super Bowl was a device designed to...prevent power outages.  Once again, a situation where the device designed to prevent the problem actually caused the problem.

This is not a unique phenomenon.  Very often, the fatal flaw in safety systems is the very complexity of those safety systems themselves.

Safety System Danger System
This happens frequently.  Over Christmas this year, Netflix went down for an extended period due to an AWS outage.  What caused it?  An engineer performing maintenance on the East Coast Elastic Load Balancing system - the system designed to balance load and prevent outages.

At the extreme end of the severity spectrum, we can look to the worst industrial disaster in human history, the Chernobyl Nuclear Meltdown.  While the most fundamental flaw was in a poor reactor design that accelerated fission during a runaway reaction rather than slowing it, the actual cause of the accident was a test of a safety system designed to prevent a runaway reaction.

The Other Blackout During the Blackout Bowl
Now that we have some perspective on how minor the disasters of media streaming are compared to other disciplines, let's revisit this year's Super Bowl.  While most media focused on the obvious power outage as the big catastrophe of the day, another one was brewing during the live streaming of the event. 

Many users reported poor video quality, buffering and other streaming issues that made their online viewing an unpleasant experience.  

My colleague Mio Babic was there with me in that war room in 2009 during the inaugural SNF online broadcast, and as the CEO of iStreamPlanet, he knows a thing or two about high-stakes live online broadcasting.  His blog post this week analyzing the success of the streaming of the Super Bowl makes some great points.  In short - streaming the Super Bowl is an incredibly complex undertaking, but the online broadcast world still doesn't have the same level of commitment to building resilient systems that the TV broadcast world does.  Until that happens, our business of online broadcast will be stifled by executives correctly arguing that TV broadcast infrastructure is superior to online infrastructure:
"So in response to the Monday morning technocrati quarterbacks that highlighted the shortcomings of the biggest and most complex online event 2013 will likely see, and eulogized the technology that helped bring it to millions of connected devices around the world, perhaps we should focus more on operational excellence that ensures quality on par with TV and drives more innovation into the connected device experience."
Operational Excellence
With all this said, how do we achieve high levels of resiliency and operational excellence?  How do you build and operate systems that anticipate common failure patterns but also handle the unexpected?

There are volumes of books written about this subject, but I feel there are 2 key areas that are often overlooked when building and operating highly resilient systems:

1) Simplicity: It's not just for your iPhone
Engineers have a tendency to over-engineer, and that goes double (literally) for engineers building redundancy and failure recovery into systems.  It is now widely accepted in the design world that simplicity provides a superior user experience, and the iPhone is perhaps the most well-known touchstone of that philosophy.

Why does this matter when building a safety or disaster recovery system?  Let's look at all the catastrophes mentioned above.  From our self-inflicted DOS attack during SNF to the Blackout Bowl to Netflix to Chernobyl, almost every root cause is tied to a failure in a safety system designed to prevent a disaster in the first place.  Things just got too complex.

Antoine de Saint-Exupery said that "A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away."

The fewer components a system has, the fewer points of failure it has.  The fewer components it has, the better the chance that everyone on an operational team can understand the whole system, rather than just their corner of it.

It is important to weigh the costs vs. benefits of putting complex backup systems in place.  Just as a software team will carefully consider the risk of including a last minute feature before a big release or a live event, it is wise to consider whether a "clever" complex safety system will in fact add more risk of failure than it is designed to prevent.

2) Test Under Load
One of the most difficult problems engineers face in this realm is that you can never truly test a large scale system at a small scale, and it's impossible to test at a large scale.  For instance, you can't fill the Superdome full of power draining equipment and  50,000 people and test the power system.  You can't run a dry-run test of an online media application with an audience of 750,000 viewers.  You can't truly test the Space Shuttle while on Earth.

So, when you encounter that big failure in production, it's because it's the first time you've ever run the process at true load.

Early in my career, I designed software for a "pilot plant" at a chemical company.  In the world of chemical manufacturing, you cannot just take a process designed in a lab and transfer it to large scale manufacturing in a chemical plant.  Heating a chemical in a beaker causes heat to distribute far differently than in a 500 gallon manufacturing reactor, and it won't behave the same way.  So, pilot plants are used as an intermediary step to try the experiment at a medium scale - perhaps using a 40 gallon reactor.  By measuring the difference between the small and medium scale behaviors, you can predict the large scale behavior.

The same principle can often apply to electrical systems and the internet.  For instance, the process we put in place to test our CDN for SNF was to perform cache-offload tests with a medium amount of load.  Before each game, we had a handful of users (10-20) stream test content and use the app for a few minutes.  This allowed us to generate a cache-offload report that would red flag any files that weren't overwhelmingly being served from the CDN's servers (and instead fell through to make requests from the origin).

Constant Improvement - "It wasn't magic"
No large failure is entirely preventable or predictable.  What's key is to have an organizational culture of constant improvement and simplification that focuses on operational quality.  My colleague David Seruyange recently posted a quote from Glenn Reid, who worked with Steve Jobs on iMovie and iPhoto that captures the attitude needed:
"... it wasn't magic, it was hard work, thoughtful design, and constant iteration." 

Friday, February 1, 2013

Using Basecamp to Manage Large Software Projects: Not for Me

I'm a big fan of 37Signals.  I often quote chapters from Getting Real when describing my software design philosophy, and I've even got a copy of ReWork on my desk.

So, it's strange that as a project leader, I've never used 37Signals' flagship product, Basecamp, to organize my project.  In the past, I've been used to using heavier-weight solutions like Altassian's Confluence and JIRA, and Microsoft's Team Foundation Server.  Recently, my teams have been using TFS both for task management/project tracking and for source control since we need to work in Visual Studio for Xbox-related work.

I'm starting up an Android project this week, and as I was setting up our source control on Github, it occurred to me that this was a great opportunity to experiment with new project management software.  Naturally, Basecamp was the first thing that came to mind.  The prospect of a radically simplified approach to tracking a backlog and burning down tasks in our Agile/Scrum process got me excited.  (Perhaps I need to go skydiving this weekend to recalibrate what gets me excited...)

Basecamp: Too Simple?
I fired up Basecamp, and eagerly jumped in to create my Product Backlog: a collection of stories and sub-tasks underneath each story.

Basecamp works based on To-do lists, so I added a Product Backlog To-do list:

I then started editing my first story "Custom Video Player", looking to add tasks.  Um...I can't.  Ok - let's try to just add each task as a note:

Wait a minute - in Agile/Scrum, tasks need distinct delivery dates and effort estimates.  This approach isn't going to work.

Am I doing something wrong?  Basecamp's simplicity is kicking my ass so far.  Let's try another approach - maybe I just need to think simpler myself:

Basecamp forces me to Simplify
Ok - I think I need to create one To-do list for each story, and then add items under each To-Do list for each task:

Alright - that's looking better!  I can even add milestones on the calendar and apply them to each task/to-do item.  Great!

Now - to apply effort estimates and see what a burn-down chart would look like...


Time for Plugins and Hacks
To my surprise, Basecamp didn't include any affordance for burndown tracking within a Sprint.  

After some searching on the internet, I discovered I'm not the only one missing my full-featured PM tools.  

In fact, a clever group of folks were similarly frustrated by maintaining a parallel Excel Spreadsheet to run burndown graphs for their Basecamp-tracked work a couple of years ago. They productized a solution that creates burndown charts for Basecamp projects called "BurndownGraph".

Perfect - that's exactly what I need in my situation!  However, to make it work, I'll need to:

  1. Give BurnDownGraph, a third party, my Basecamp login and password.
  2. Manually specify duration/estimates in the names of each To-do in a rigid format: "Custom Video Player 8.5h".
  3. Keep all account information in Basecamp and BurnDownGraph synced.
  4. Shell out more money for the BurnDownGraph product.

Does this still make sense?
Suddenly, it dawned on me: "THIS ISN'T SIMPLE".  Going to these lengths of hackery is something I've trained myself over many years in software to recognize not as cool and clever, but brittle and non-scaleable.

The irony of the situation is that 37Signals are my revered proponents of achieving beautiful design and great user experiences through simplicity.

Ultimately, I'm still a Basecamp fan, and I do respect 37Signals for keeping Basecamp simple and easy for the use cases for which it works well.  However, I'm now convinced that it's not the right tool for the job of managing larger agile software projects.