Friday, February 8, 2013

How Catastrophic Failures in the Super Bowl, Online Video, and Nuclear Power Are Related

Failure is fascinating.

I think there's something ingrained in human nature (and particularly in engineers and designers) that draws us to observe failure and ask "How can I prevent this from happening to me"?  It's why we slow down to look at car wrecks, and why we're glued to TV news coverage of disasters.  People want to feel in control, and like to rationalize why that disaster they're observing won't happen to them.  Even watching celebrity train wrecks from Gary Busey to Honey Boo Boo provides an affirmation that "At least that's not happening to me."

In particular, I find myself fascinated by engineering failures.  As a kid, watching the footage of the collapse of the Tacoma Narrows Bridge with my Dad (also an engineer) instilled me with a sense that all of our technical accomplishments are fleeting and fragile.  The question reverberating in my mind for the rest of my life has been "How do we stop these disasters from happening?"

  
The strange thing is, that drive to prevent these kinds of failures is often itself the root cause of the catastrophe. 

My Catastrophe Story
As an adult now, I've seen the occasional engineering catastrophe up close, though usually resulting in failed video feeds, rather than collapsed bridges.  In "war room" or "control room" settings, I've felt the intense pressure of a major live video event going wrong, and needing to help fix it quickly while the pressure mounts and the audience disappears.  

During the very first internet broadcast of NBC's Sunday Night Football experience in 2009, our war room operation was in a panic.  Our real time analytics were showing a steep drop-off in viewership, and the origin servers streaming out the live video for the event were under incredible load.  It looked like a malicious 3rd party was executing a Denial of Service (DOS) attack against us.  As it turned out, we were our own attackers.

To handle the large load from a nation streaming such a big sporting event, we had configured a Content Delivery Network (CDN) to handle the traffic.  Individuals streaming the event would stream it from the CDN's servers, and the CDN's servers would get the stream from our origin servers.  As a mechanism to ensure that the CDN's servers were always up to date, they would issue requests to the origin servers whenever they needed content.  When they failed to get that content, they would ask at a greater frequency than usual.  At the same time, the configuration files for the online player were also hosted in the same manner.  As it turns out, the configuration of the player and CDN caused each configuration file request to go to the origin servers.  This created a lot of load on those servers, so they stopped responding to the CDN for video requests, which caused the CDN to request video chunks much more frequently than normal.  This led to a cascade of increased load - basically, a denial of service attack on our origin by our own CDN.

So, the system we'd designed to reduce load on our origins caused lots of load on our origins.

The good news was that we were able to keep the main feed up for the whole game for most users, with only secondary functionality taking a fatal hit.  By the next game, we had established a robust testing mechanism to determine proper cache offload configuration prior to the games, and things went relatively smoothly for the rest of the season.

The Blackout Bowl
With that experience in mind, I felt strong empathy for the electrical engineers and operators who dealt with the "Blackout Bowl" this past Sunday, as the largest television event of the year threatened to turn into a total disaster as the lights turned off during the middle of the Super Bowl.  I've been there, and it was painful to watch (even though it gave our Niners a much needed reprieve).

The funny thing is, the device that caused the power outage during the Super Bowl was a device designed to...prevent power outages.  Once again, a situation where the device designed to prevent the problem actually caused the problem.

This is not a unique phenomenon.  Very often, the fatal flaw in safety systems is the very complexity of those safety systems themselves.

Safety System Danger System
This happens frequently.  Over Christmas this year, Netflix went down for an extended period due to an AWS outage.  What caused it?  An engineer performing maintenance on the East Coast Elastic Load Balancing system - the system designed to balance load and prevent outages.


At the extreme end of the severity spectrum, we can look to the worst industrial disaster in human history, the Chernobyl Nuclear Meltdown.  While the most fundamental flaw was in a poor reactor design that accelerated fission during a runaway reaction rather than slowing it, the actual cause of the accident was a test of a safety system designed to prevent a runaway reaction.


The Other Blackout During the Blackout Bowl
Now that we have some perspective on how minor the disasters of media streaming are compared to other disciplines, let's revisit this year's Super Bowl.  While most media focused on the obvious power outage as the big catastrophe of the day, another one was brewing during the live streaming of the event. 

Many users reported poor video quality, buffering and other streaming issues that made their online viewing an unpleasant experience.  

My colleague Mio Babic was there with me in that war room in 2009 during the inaugural SNF online broadcast, and as the CEO of iStreamPlanet, he knows a thing or two about high-stakes live online broadcasting.  His blog post this week analyzing the success of the streaming of the Super Bowl makes some great points.  In short - streaming the Super Bowl is an incredibly complex undertaking, but the online broadcast world still doesn't have the same level of commitment to building resilient systems that the TV broadcast world does.  Until that happens, our business of online broadcast will be stifled by executives correctly arguing that TV broadcast infrastructure is superior to online infrastructure:
"So in response to the Monday morning technocrati quarterbacks that highlighted the shortcomings of the biggest and most complex online event 2013 will likely see, and eulogized the technology that helped bring it to millions of connected devices around the world, perhaps we should focus more on operational excellence that ensures quality on par with TV and drives more innovation into the connected device experience."
Operational Excellence
With all this said, how do we achieve high levels of resiliency and operational excellence?  How do you build and operate systems that anticipate common failure patterns but also handle the unexpected?

There are volumes of books written about this subject, but I feel there are 2 key areas that are often overlooked when building and operating highly resilient systems:

1) Simplicity: It's not just for your iPhone
Engineers have a tendency to over-engineer, and that goes double (literally) for engineers building redundancy and failure recovery into systems.  It is now widely accepted in the design world that simplicity provides a superior user experience, and the iPhone is perhaps the most well-known touchstone of that philosophy.



Why does this matter when building a safety or disaster recovery system?  Let's look at all the catastrophes mentioned above.  From our self-inflicted DOS attack during SNF to the Blackout Bowl to Netflix to Chernobyl, almost every root cause is tied to a failure in a safety system designed to prevent a disaster in the first place.  Things just got too complex.

Antoine de Saint-Exupery said that "A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away."

The fewer components a system has, the fewer points of failure it has.  The fewer components it has, the better the chance that everyone on an operational team can understand the whole system, rather than just their corner of it.

It is important to weigh the costs vs. benefits of putting complex backup systems in place.  Just as a software team will carefully consider the risk of including a last minute feature before a big release or a live event, it is wise to consider whether a "clever" complex safety system will in fact add more risk of failure than it is designed to prevent.

2) Test Under Load
One of the most difficult problems engineers face in this realm is that you can never truly test a large scale system at a small scale, and it's impossible to test at a large scale.  For instance, you can't fill the Superdome full of power draining equipment and  50,000 people and test the power system.  You can't run a dry-run test of an online media application with an audience of 750,000 viewers.  You can't truly test the Space Shuttle while on Earth.

So, when you encounter that big failure in production, it's because it's the first time you've ever run the process at true load.

Early in my career, I designed software for a "pilot plant" at a chemical company.  In the world of chemical manufacturing, you cannot just take a process designed in a lab and transfer it to large scale manufacturing in a chemical plant.  Heating a chemical in a beaker causes heat to distribute far differently than in a 500 gallon manufacturing reactor, and it won't behave the same way.  So, pilot plants are used as an intermediary step to try the experiment at a medium scale - perhaps using a 40 gallon reactor.  By measuring the difference between the small and medium scale behaviors, you can predict the large scale behavior.

The same principle can often apply to electrical systems and the internet.  For instance, the process we put in place to test our CDN for SNF was to perform cache-offload tests with a medium amount of load.  Before each game, we had a handful of users (10-20) stream test content and use the app for a few minutes.  This allowed us to generate a cache-offload report that would red flag any files that weren't overwhelmingly being served from the CDN's servers (and instead fell through to make requests from the origin).

Constant Improvement - "It wasn't magic"
No large failure is entirely preventable or predictable.  What's key is to have an organizational culture of constant improvement and simplification that focuses on operational quality.  My colleague David Seruyange recently posted a quote from Glenn Reid, who worked with Steve Jobs on iMovie and iPhoto that captures the attitude needed:
"... it wasn't magic, it was hard work, thoughtful design, and constant iteration." 

11 comments:

  1. What a good suggestion posted by the peoples.hedazikao

    ReplyDelete
  2. I'm so excited while I read all the comments..designingwithsydney

    ReplyDelete
  3. I'm so excited while I read all the comments..antonwilde

    ReplyDelete
  4. Its very useful if every one should follow this.mountain-mysteries

    ReplyDelete
  5. Keep posting the articles,useful to every one..gamexin

    ReplyDelete
  6. Its very useful if every one should follow this..shcriminallaw

    ReplyDelete
  7. Thank you for this guidance.its really informative for all..gkyibiao

    ReplyDelete

Note: Only a member of this blog may post a comment.