Storm in The Storage Cloud…And It Flooded My Office
For some strange reason I choose to work even when I’m not working and have what some could call two jobs (well, one real job and another job that supports itself, anyway). My day job is what you see here: helping to change the way people think about and implement virtualization in their data center. My moonlit weekend job that doesn’t quite pay any bills (yet) is professional photographer. To date, these two worlds haven’t had any relation or overlap at all (although I did take the main picture you see in the blog header, which is a shot of freshly installed data center racks, so maybe that counts). Last night, however, my separate professional lives collided in a storm I hadn’t witnessed before, and I felt rouge waves on both sides.
As has been widely reported, Amazon’s S3 service was down for a good while on Sunday, July 20th. I don’t personally or directly use their service (although I do know of individuals who are looking into it as a safe and secure backup system), however I do use SmugMug as my back-end photo “store” and processing lab for the pro photog business and (as I learned on Monday) SmugMug uses S3 for all of my valuable and (hopefully someday) bill-paying photography. I have my own local backup systems that I manage (more on that some other time) and I don’t rely on SmugMug as my content storage house, but I do rely on them to make my photography available for purchase (always available, always fast, and always securely). But I don’t want to know what they use in their data center or how they manage and store my content; I only want to know that my content is safe and available. And all was good in the fields until Sunday evening when S3 went down, and took SmugMug (and all of the pro photographers they support) down with it (details available here).
So on Monday morning I began looking into the S3 outage for the Day Job and just happened to see that my Night Job was impacted by the outage, and that got my head all spinning. It got me spinning primarily because this is the 2nd outage that S3 has suffered in the past few months, and that’s big business for a lot of people beyond SmugMug. For most normal enterprise IT shops that kept their storage in-house, a critical outage and unavailability of dynamic data twice in such a short amount of time would cause the higher-ups to start asking questions about what, why, who, and how to make sure this never happens again. I imagine those types of questions are happening for large-scale S3 customers, like SmugMug, all around the globe.
The other reason I got so spun up was the response, or lack-there-of, from Amazon. As far as I can tell, the first reports came into their public forum from customers in droves reporting a “Service Unavailable” error message. Shouldn’t Amazon have known before the customers, and shouldn’t they have done a better job (beyond posting a green/yellow/red dot on a service page) notifying all their customers? Does SmugMug really want to find out about a storage outage when they try to retrieve my galleries for perspective customer, or would they prefer to know before hand so they don’t let their app spin indefinitely? Or here’s a novel idea: Perhaps Amazon should architect their storage service in an HA/DR manner so that a customer never sees a “Service Unavailable” message, or more importantly so that their service never goes down beyond a simple blip while service requests are redirected. Highly available data centers ain’t rocket science, and since Amazon is building VDCs like nobody’s business, perhaps they should already know this…
I don’t want to be too short or critical here, but f anything, Amazon is blazing a trail in the Clouds on how not to build a production-class cloud service. The core requirement for offering a cloud service has go to be availability above everything else. Otherwise there’s no reason for a customer to trust the service with their mission critical data. My Night Job customer persona is hoping that SmugMug is really sticking it to Amazon for taking them down (and at the same time making sure all their own eggs don’t fall off the tree when the S3 nest crashes again).
I think I’m going to write Amazon’s regular storefront customer service and ask for a credit in their MP3 download store to compensate for all the money I lost by not being able to sell my photographs while S3 was down. Think they’ll go for it? ![]()
