On June 29, 2012, at about 11:20 p.m. Eastern, several corners of the Internet went dark, including Reddit.com, which according to Alexa's engine ranks 61st in Website traffic in the United States.
At the exact same time, power went out for several states in the Atlantic Seaboard, including northern Virginia, where Amazon.com has a large datacenter.
No, it's no coincidence. The sites that the datacenter served up went down, and they went down hard. My colleague, Robin Miller, did a blog post detailing how a few people may have lost their jobs, and that if you kept yours, the word of the week is "failover." 21st Century IT's own Alison Diana posted that using a single Electric Computer Cloud (EC2) instance isn't enough -- companies need to use the availability zone features to manage failure.
That's true -- at least to a point. People who ask Amazon to manage their servers are told, in the fine print, that if a datacenter goes out, their software will go down, unless the company uses the tools Amazon provides to provide high-availability.
And, just like RAID-5, the tools cost more to add redundancy than it would cost to, say, run with a single disk drive.
What people don't talk about is how much.
Consider Netflix, which has an explicit, public strategy
to use those very same availability zone features to prevent downtime.
Except that, according to Forbes, Netflix fell down just as hard as the rest, it just managed to recover in 90 minutes. Most other sites were back within two hours.
How much did that extra half hour cost?
To get some insight into the time and energy it takes to develop the infrastructure Netflix has, you might look at their help wanted pages for engineering. Specifically, look at all the job titles with "cloud" or "reliability" in them.
Remember when we thought cloud computing was going to make things easier? This company has 10 open positions to grow the team of programmers working to coordinate its cloud efforts.
That does not mean programmers working on features for the company, designing streaming video, archiving content, or processing payments. No, that is simply building the platform to make cloud computing possible.
And those are just the open slots.
Again, I ask -- remember when we thought cloud computing would make things easier?
It was supposed to be a grid that you plug into to make computing power available like a light bulb, or a television.
It's great that Amazon has some tools to enable remote servers. Even better that we can get pre-configured compute environments. For some Websites that are free to use and advertiser-funded, like Google, Facebook, Twitter, and Reddit, that might just be good enough.
For the rest of us, cloud computing is no free lunch. We'll continue to have work to do.
The grid isn't here yet, at least not as it was originally sold to us. It takes too much work, the standards and compatibility don't exist (try switching from IBM to HP to Amazon) and real high-availability requires redundancy, coordination of servers, and a small army of technical staff.
Still, we continue to get closer bit by bit.
What do you think the next inch will be?