On Tuesday, February 28th, I logged into HubSpot and noticed that my cloud-hosted Workstate logo image in my email sig was broken. It wasn’t long before some of the development staff lit up Slack with messages like, “FYI: A lot of [Amazon] S3 resources are down, including a lot of github file hosting.”
After looking around a bit more, it was clear that Amazon Web Services was having a widespread problem with its US–East Region.
The disruption had IT departments scrambling, and rendered many internet services partially or completely unusable. It’s possible your own SaaS product or cloud solution was also impacted. Certainly AWS was experiencing problems – but guess what? If the service you own was down, it’s partly your fault.
Amazon’s outage was limited to one of their 16 geographic Regions throughout the world. They have six in North America alone. Most companies set up services in a single Region and use Availability Zones within that Region, which are unique data-centers or clusters of data-centers, to keep their cloud services highly available at all times. If an entire AWS Region experiences problems, and your services are all hosted in that same Region, you will have service failures like those on February 28th.
You may be thinking that this is a major fault in cloud computing. Suddenly realizing you may have too much dependency on one cloud vendor could have you searching the web comparing Azure vs. AWS. Should you?
The answer is, probably not.
It’s tempting to blame all of your service interruptions on Amazon. But consider the fact that Amazon.com – certainly hosted on AWS – had no service interruption during the outage. This isn’t because Jeff Bezos has a top-secret deep Amazon web, or because AWS cares more about Amazon.com then your company. It’s because Amazon has decided that high availability means not only distributing across Availability Zones within a single Region, but also distributing their services across multiple Regions.
This is why your service failures are partly your fault. If your software architecture is not designed across multiple Regions then you are susceptible when a single Region has issues. I am by no means letting AWS off the hook for such a massive disruption, but you have a choice when building your services to define what “high availability” means to your company. If you choose to put all of your services in one Region, then you are subject to failures within the limits of any specific service’s Service Level Agreement (SLA).
Here are a few things to do if your cloud applications were affected by the outage, or if you are wondering if a similar outage could impact you:
- Ask your cloud architect or cloud partner if you are taking advantage of multiple Availability Zones and multiple Regions. Take action depending on your stomach for the small, but very disruptive, impact of a few hours of downtime.
- Prepare a contingency plan for loss of services. Cloud services are extremely dependable, but having an option for your end users during an outage will help avoid a crisis.
- Have a plan in place for recovering as quickly as possible from an outage. Do you have a way to validate that your services are running as normal after any sort of failure or disruption? Don’t let a hours-long outage turn into days of recovery. What is your process to validate continuity of service?
- Review your cloud architecture best practices, and if you need changes, put them on your roadmap.
If you’re interested in learning more about Availability Zones, Regions, or cloud computing application development please don’t hesitate to contact Workstate.Say Hello to Cloud Shift