The AWS Outage – How to Avoid Downtime in the Cloud

Posted by on Apr 23, 2011 | 36 Comments

Amazon AWS suffered one of the biggest outages in cloud computing to date, with data housed in a single availability zone knocked out for more than 24 hours in some cases. Many people are pointing to this as proof of a cloud computing failure. Some have suggested that traditional servers-in-a-rack models are safer. These people apparently don’t understand the flexibility of cloud computing.

The way Amazon AWS works, you have the ability to easily replicate your environment across multiple regions and availability zones to prevent the kind of catastrophic downtime we saw this week. If you opt not to deploy across regions and availability zones, you are effectively gambling Amazon won’t have an AWS outage like this one. In the case of this particular outage, merely being in multiple availability zones was not enough because AWS had issues across the entire US East Region, so spanning availability zones in the US East region was not enough. For more complete redundancy, it’s important to look at application architecture that spans both availability zones and regions, so that when US East has a problem, your app can continue to run from California, Ireland, or one of the other regions. Those that experienced the full extent of the downtime take the same risk as IT shops using servers in a single data center. In both models (cloud and servers-in-a-rack), if you don’t have a distributed infrastructure, you’re safe up until the point your data center becomes unavailable for some reason. In the case of AWS, the data center is the availability zones that went unresponsive.

Anyone with redundancy in place in another region’s availability zone merely needed to fail over to the other zone and continue on with minimal downtime. With even more fault-tolerance in place, you could have an intance pool that spans availability zones, meaning an outage might cause an existing session to fail for some users, but as a whole no one would even know you went down at all.

This is the beauty of cloud computing. You can easily have as much infrastructure ready to go as you need, or have plenty on standby, if your budget doesn’t allow for all those instances to run continuously. Cloud computing in general, and the AWS infrastructure in particular, makes SLA and uptime a software problem instead of a hardware problem. Developers are in control of as much or as little SLA as they can handle within the AWS construct. That’s a paradigm shift.

Writing the code to span availability zones might seem like a headache, which is where companies like RightScale make the architecture piece of configuring AWS a breeze. With RightScale, you can pool your resources across regions and zones and make them talk to each other without needing to start from scratch each time.

For those companies who got hit by the AWS outage, I feel your pain. I’ve lived through downtime and would like to avoid ever experiencing it again. AWS is currently the best solution for avoiding that downtime, but you have to make use of all the tools Amazon provides for it to work. The AWS outage wasn’t a failure of the cloud, it was the failure of development teams to build in the redundancy provided to them.

  • Scott

    Please stop talking about the AWS outage in the past tense as if it “was” over after only 24 hours. it is still going on.

    • http://chris.pirillo.com/ Chris Pirillo

      In what capacity? Has Amazon at least come to the communication front?

      • Scott

        Yes they have been much more communicative – but many many instances and volumes are still down/stuck – I have one just coming up now for the first time – but it’s struggling.

        • http://www.jakeludington.com Jake Ludington

          That’s definitely not a fun situation to be in, but the point of the article is that you can plan to avoid this by building beyond being just in US East.

  • Scott

    Please stop talking about the AWS outage in the past tense as if it “was” over after only 24 hours. it is still going on.

    • http://chris.pirillo.com/ Chris Pirillo

      In what capacity? Has Amazon at least come to the communication front?

  • M.

    Please check your facts. Multiple AZs and products that can be set to automatically use multiple AZs (like EBS/RDS/ElasticBeanstalk) are currently affected: http://status.aws.amazon.com/

    You do have the the ability to use Multiple AZ redundancy but that wouldn’t help in such a situation. Multiple Region redundancy is also possible but is not cost-effective due to the cost of data transfer and the further infrastructure required to keep regions with substantial latency between them in sync. There is of course software that handle multiple Region type situations but they are at the same level of complexity as Multiple Data Center solutions.

    • http://www.jakeludington.com Jake Ludington

      I never said redundancy was cheap. You get better availability by using multiple AZ than a single AZ in a single region. You can have truly high availability by using multiple regions with multi-AZ in those regions.

  • http://www.facebook.com/people/James-Isabel/100001683146522 James Isabel

    Just a suggestion — if you’re going to write snarky rants on a site headlined “IT Professionals”, you might want to have just the slightest clue what you’re talking about.

    AWS’s outage was/is across every single availability zone in US East (their primary region). Every single one. In other words, you’re flat out wrong: those who took plenty of advantage of the “redundancy provided to them” were just as screwed as everyone else.

    • http://www.loudmouthman.com Loudmouthman

      Hi James, GeekPlantonline.com stayed online by being in eu-east-1a we didnt go down because we used the redundancy offered. us-east-1a is the default zone when you login but it is not primary, i am not sure they have a primary but if you have documentary evidence that would be awesome. We took ‘advantage’ of the zones we didnt go down.

    • http://www.facebook.com/people/Dean-Collins/674616722 Dean Collins

      uhm ok James obviously never thought of hosting in a different geographic region (europe) or more logically with a different provider.

    • http://www.jakeludington.com Jake Ludington

      James,

      I have a fair amount of background deploying solutions to AWS. While US East may be Amazon’s primary region, it’s certainly not their only region. By configuring applications to spam US East, the California region, Ireland, and Singapore, or some subset of that, you do in fact avoid the issues people are experiencing by only being in US East.

      This is very similar to the model of configuring servers housed in say a data center in Austin, TX and a data center in San Diego, CA for redundancy. The major difference is AWS is quite a bit easier to configure and manage in this regard and far more affordable.

      For smaller application implementations, you can further prescript your application, maintain redundancy outside AWS, and redeploy to another availability zone very quickly. Traditional server environments don’t make that nearly as easy.

      • http://www.facebook.com/people/James-Isabel/100001683146522 James Isabel

        You realize that “availability zones”, as defined by Amazon, are contained within a single region, correct? Your entire post is about how had those stupid, incompetent IT people only deployed to multiple AZs, they would’ve been saved from their own ineptitude.

        You are wrong.

        On top of that, Amazon makes it incredibly difficult to transport data, especially EBS data, between regions. ELB does not work across regions. Their contracts and documentation strongly imply that using multiple AZs in a region makes one’s services fault-tolerant, and that turns out to be very, very far from the truth.

        • http://www.jakeludington.com Jake Ludington

          I do realize that AZ are a subset of a region, which is why I suggest true fault tolerance spans regions. Yes, there’s extra work involved in developing for multi-region, but if you are worried about an outage, that is what you need to do.

          Going multi-region is not much different than the traditional environment where you’d go multi-data center. Multi-region in AWS is cheaper than multi-data center and in my experience, easier to configure.

  • http://www.facebook.com/people/James-Isabel/100001683146522 James Isabel

    Just a suggestion — if you’re going to write snarky rants on a site headlined “IT Professionals”, you might want to have just the slightest clue what you’re talking about.

    AWS’s outage was/is across every single availability zone in US East (their primary region). Every single one. In other words, you’re flat out wrong: those who took plenty of advantage of the “redundancy provided to them” were just as screwed as everyone else.

    • http://www.jakeludington.com Jake Ludington

      James,

      I have a fair amount of background deploying solutions to AWS. While US East may be Amazon’s primary region, it’s certainly not their only region. By configuring applications to spam US East, the California region, Ireland, and Singapore, or some subset of that, you do in fact avoid the issues people are experiencing by only being in US East.

      This is very similar to the model of configuring servers housed in say a data center in Austin, TX and a data center in San Diego, CA for redundancy. The major difference is AWS is quite a bit easier to configure and manage in this regard and far more affordable.

      For smaller application implementations, you can further prescript your application, maintain redundancy outside AWS, and redeploy to another availability zone very quickly. Traditional server environments don’t make that nearly as easy.

      • http://www.facebook.com/people/James-Isabel/100001683146522 James Isabel

        You realize that “availability zones”, as defined by Amazon, are contained within a single region, correct? Your entire post is about how had those stupid, incompetent IT people only deployed to multiple AZs, they would’ve been saved from their own ineptitude.

        You are wrong.

        On top of that, Amazon makes it incredibly difficult to transport data, especially EBS data, between regions. ELB does not work across regions. Their contracts and documentation strongly imply that using multiple AZs in a region makes one’s services fault-tolerant, and that turns out to be very, very far from the truth.

        • http://www.jakeludington.com Jake Ludington

          I do realize that AZ are a subset of a region, which is why I suggest true fault tolerance spans regions. Yes, there’s extra work involved in developing for multi-region, but if you are worried about an outage, that is what you need to do.

          Going multi-region is not much different than the traditional environment where you’d go multi-data center. Multi-region in AWS is cheaper than multi-data center and in my experience, easier to configure.

  • http://c3mdigital.com/ Chris Olbekson

    This post is misinformation. Have you read this: http://justinsb.posterous.com/aws-down-why-the-sky-is-falling

    This morning, multiple availability zones failed in the us-east region. AWS broke their promises on the failure scenarios for Availability Zones. It means that AWS have a common single point of failure (assuming it wasn’t a winning-the-lottery-while-being-hit-by-a-meteor-odds coincidence). The sites that are down were correctly designing to the ‘contract’; the problem is that AWS didn’t follow their own specifications. Whether that happened through incompetence or dishonesty or something a lot more forgivable entirely, we simply don’t know at this point. But the engineers at quora, foursquare and reddit are very competent, and it’s wrong to point the blame in that direction.

    • http://www.jakeludington.com Jake Ludington

      This post is not misinformation. If you want true redundancy in your configuration, you need to deploy to multiple regions, which is the AWS equivalent to multiple data centers.

  • Anonymous

    What a dumb article. You don’t know anything about technology. The Amazon issues define failure of the cloud.

    • http://chris.pirillo.com/ Chris Pirillo

      That’s like saying a random hard drive crash defines “failure of magnetic storage.”

  • Anonymous

    What a dumb article. You don’t know anything about technology. The Amazon issues define failure of the cloud.

    • http://chris.pirillo.com/ Chris Pirillo

      That’s like saying a random hard drive crash defines “failure of magnetic storage.”

  • http://www.grippo.com Jorge Grippo

    So, to avoid downtime in the cloud you have to span to multiple regions? And now you are saying that? Other story could be if you opened our mouth before. Until your ilumination happened, and EBS were down, multi AZ was OK, right or wrong?

    • http://www.jakeludington.com Jake Ludington

      Multiple availability zones in the same region is more redundant than being in a single AZ in a single region. But you have to factor in your own tolerance for downtime. If never going down is critical, deploying in multiple regions, across multiple availability zones in those regions is the more fault tolerant approach. This was true before AWS had the current outage.

      I suppose if you were truly paranoid, you’d set up a configuration that relied on AWS but could fail over to Rackspace Cloud if AWS went down, but again, it’s about assessing risk and determining how much potential downtime is acceptable.

  • http://profiles.google.com/jorge.grippo Jorge Grippo

    So, to avoid downtime in the cloud you have to span to multiple regions? And now you are saying that? Other story could be if you opened our mouth before. Until your ilumination happened, and EBS were down, multi AZ was OK, right or wrong?

    • http://www.jakeludington.com Jake Ludington

      Multiple availability zones in the same region is more redundant than being in a single AZ in a single region. But you have to factor in your own tolerance for downtime. If never going down is critical, deploying in multiple regions, across multiple availability zones in those regions is the more fault tolerant approach. This was true before AWS had the current outage.

      I suppose if you were truly paranoid, you’d set up a configuration that relied on AWS but could fail over to Rackspace Cloud if AWS went down, but again, it’s about assessing risk and determining how much potential downtime is acceptable.

  • Markb1439

    But many cloud providers tout redundancy as one of the primary benefits…automatic redundancy, e.g., if one resources goes down things are automatically rerouted and there is little or no downtime. So, to say that you need to use your own redundancy and put your data in multiple baskets…that means that the cloud has no advantage in terms of redundancy. That’s just like setting up a dedicated server and then setting up an identical backup server in another location.

    • http://chris.pirillo.com/ Chris Pirillo

      It does have an advantage insofar as it is typically more affordable to take on inordinate amounts of (unplanned) traffic and also as the clear advantage of being able to scale more quickly.

  • Markb1439

    But many cloud providers tout redundancy as one of the primary benefits…automatic redundancy, e.g., if one resources goes down things are automatically rerouted and there is little or no downtime. So, to say that you need to use your own redundancy and put your data in multiple baskets…that means that the cloud has no advantage in terms of redundancy. That’s just like setting up a dedicated server and then setting up an identical backup server in another location.

    • http://chris.pirillo.com/ Chris Pirillo

      It does have an advantage insofar as it is typically more affordable to take on inordinate amounts of (unplanned) traffic and also as the clear advantage of being able to scale more quickly.

  • http://www.jakeludington.com Jake Ludington

    This post is not misinformation. If you want true redundancy in your configuration, you need to deploy to multiple regions, which is the AWS equivalent to multiple data centers.

  • http://www.jakeludington.com Jake Ludington

    I never said redundancy was cheap. You get better availability by using multiple AZ than a single AZ in a single region. You can have truly high availability by using multiple regions with multi-AZ in those regions.

  • http://www.jakeludington.com Jake Ludington

    That’s definitely not a fun situation to be in, but the point of the article is that you can plan to avoid this by building beyond being just in US East.

  • Bill

    This is not a failure of “cloud computing”. It might be a failure of Amazon’s customers to select and deploy services in a way to get the redundancy needed. It might be a failure of Amazon to deliver the services according to their own specifications. It certainly is a failure of the technology not matching the marketing hype used to promote it and the unrealistic expectations thus created.