On June 29, 2012, at about 11:20 p.m. Eastern, several corners of the Internet went dark, including Reddit.com, which according to Alexa's engine ranks 61st in Website traffic in the United States.
At the exact same time, power went out for several states in the Atlantic Seaboard, including northern Virginia, where Amazon.com has a large datacenter.
No, it's no coincidence. The sites that the datacenter served up went down, and they went down hard. My colleague, Robin Miller, did a blog post detailing how a few people may have lost their jobs, and that if you kept yours, the word of the week is "failover." 21st Century IT's own Alison Diana posted that using a single Electric Computer Cloud (EC2) instance isn't enough -- companies need to use the availability zone features to manage failure.
That's true -- at least to a point. People who ask Amazon to manage their servers are told, in the fine print, that if a datacenter goes out, their software will go down, unless the company uses the tools Amazon provides to provide high-availability.
And, just like RAID-5, the tools cost more to add redundancy than it would cost to, say, run with a single disk drive.
What people don't talk about is how much.
Consider Netflix, which has an explicit, public strategy
to use those very same availability zone features to prevent downtime.
Except that, according to Forbes, Netflix fell down just as hard as the rest, it just managed to recover in 90 minutes. Most other sites were back within two hours.
How much did that extra half hour cost?
To get some insight into the time and energy it takes to develop the infrastructure Netflix has, you might look at their help wanted pages for engineering. Specifically, look at all the job titles with "cloud" or "reliability" in them.
Remember when we thought cloud computing was going to make things easier? This company has 10 open positions to grow the team of programmers working to coordinate its cloud efforts.
That does not mean programmers working on features for the company, designing streaming video, archiving content, or processing payments. No, that is simply building the platform to make cloud computing possible.
And those are just the open slots.
Again, I ask -- remember when we thought cloud computing would make things easier?
It was supposed to be a grid that you plug into to make computing power available like a light bulb, or a television.
It's great that Amazon has some tools to enable remote servers. Even better that we can get pre-configured compute environments. For some Websites that are free to use and advertiser-funded, like Google, Facebook, Twitter, and Reddit, that might just be good enough.
For the rest of us, cloud computing is no free lunch. We'll continue to have work to do.
The grid isn't here yet, at least not as it was originally sold to us. It takes too much work, the standards and compatibility don't exist (try switching from IBM to HP to Amazon) and real high-availability requires redundancy, coordination of servers, and a small army of technical staff.
Cloud computing providers will need to support hundreds of thousands of users and services to ensure the highest quality. Robust and dynamic infrastructures are critical: transparency, scalability, monitoring/management, and security.
Matt Heusser 7/5/2012 2:34:26 PM User Rank Blogger
Re: Balance in reporting
"One would expect the production site fails over DR site when there is outage in production site."
The point I was trying to get at with the cloud was, if something fails, that's not supposed to be my problem. We are supposed to have abstracted away the whole concept ... and we're not there yet. Perhaps my standards are too high, but I seem to recall that was a great deal of the rhetroic that got this whole cloud thing started, no? :-)
Matt Heusser 7/5/2012 2:33:11 PM User Rank Blogger
Re: Balance in reporting
Good point Rich, the EC2 SLA is 99.95% ( http://aws.amazon.com/ec2-sla/ ) though some of the coverage is reporting much longer outages, I published the ones that were best confirmed. There was a similar outage on June 18th ( http://www.zdnet.com/blog/btl/amazon-web-services-suffers-partial-outage/79981 ) and, if memory serves, on June 6th.
Last night, our power grid wet out in my tiny town in West Michigan, and it does go down for a few hours a year. Perhaps the cloud is a grid, and my problem is one of expectations. :-)
Rich Bruklis 7/4/2012 11:45:19 AM User Rank Blogger
Re: Balance in reporting
Amazon offers 99.95% uptime which is about 4 hours and 22 minutes of downtime per year. It seems to me that they go down twice a year for about 2 hours each so I think they know what their risks and recovery are.
I'd bet their data center is about 10 times better than most companies data centers when it comes security, efficiencies, scalability, and uptime.
I think these large, publicly-traded cloud companies (Amazon, Verizon/Terremark, Rackspace, etc) are like the jury system in the US - it isn't perfect but its the best there is.
Thanks for the update Matt. What is important is the architecture of the environment with redundancy and DR point of views. One would expect the production site fails over DR site when there is outage in production site. That should be happening regardless of nature of the outage, the site would not be considered a DR site if the natural disaster impacts both the production and DR site.
In an era with everly increasing severe, and unpredictible weather at times, I find that the story of Amazon's EC2 services to be just one of many. Look at the recent storm which has impacted several million citizens in just one night from Indiana, all the way through Ohio, West Virginia, Virginia, Maryland, DC, etc. It isn't just cloud hosted solutions which are vulnerable to outages of this type. Non-cloud solutions are no safer at times.
A great example are Point of Sale systems, such as those used by so many grocers. retailers, gas stations, convenience stores, restaurants, etc. When the power is out at many of these locations, the expectation is it will quickly turn back on. Yet as I a West Virginian look at the news I cannot help but look at major Chains, like Krogers, Walmart, Food Lion, Foodland etc who serve a large area and have now after nearly two plus days of little if any power have had to dispose of millions of dollars of perishable products - everything from meats, cheeses, refrigerated juices, prepared meals and deli meats and side items, and also highly time sensitive produce.
It saddens me that in the effort to fine tune profit lines, companies like these which provide vital services are caught just as the local populace without any type of failover plan. At some stores even things that could be used and necessary in the short term like candles, matches, propane, grills, charcoal, and bottled water remained on shelves because stores had no fail over plan for how to handle inventory when the computers and power were down. Retailers of all sorts turned people away without cash as Credit Card and even Check verifying machines unable to connect due to lack of power, or down telephone lines left many desperate consumers stranded without cash. Even if people had cash though, many stores had insufficient cash reserves, or lack of ability to process and sell merchandise absent the scanner based UPC code look up machines that so many retailers are dependent upon.
What's worse? People look far outside their normal shopping zones to shop at other stores of the chain, or even competitors who no doubt raked in cash hand over fish on simple commodities like Generators, Ice, Bottled Water, paper plates, cleaning supplies, and canned and other non-perishable items. Those retailers who had a plan, positioned themselves to not only help the people seeking to find products, but likely increased their profit margins in the short term, which no doubt investors of any major retailer will appreciate.
For me at least this is something that all companies need to address, whether they are cloud or not cloud related. When I read that retailer X or Y contemplated buying a generator at some point but thought it would be used to infrequently to justify it's cost, I can't help but shake my head as more money is walking out the door in dumpster bins then in their bank accounts. With freak ice storms, out of control wildfires, rain storms, hurricanes and tornadoes, and other such weather phenomena becoming more common in occurrence if I were a CEO I'd be rethinking my fail over plans at the local level when connectivity and power goes down.
Matt Heusser 7/3/2012 12:59:09 PM User Rank Blogger
Balance in reporting
My friend, Wayne Rash, points out in a recent post authored hours after mine went up pointed out that the situation was what the law might consider an "act of God", and Amazon did everything it could have. In his words, Amazon's Data Center was "fully redundant in itself, and served by redundant backup power and redundant power grids, redundant network access went down under the combined onslaught of massive power outages, massive Internet outages, phone line outages and cell system outages. Not only did everything go down, but nobody could call for backup. And, of course, even if the staff had known that this event was happening, they couldn't have traveled there anyway. Most of the roads were blocked." Wayne goes on to point out that most failover/restore systems in North America aren't nearly as well prepared as Amazon, and, if you aren't on EC2, it may be time to look to ourselves before pointing fingers.
The man has a point, and I thought it was worth a brief follow-up to mention.
To save this item to your list of favorite 21st Century IT content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.