News

B-17 during World War II

Resilience by Design: Why Uptime Starts with Planning

In World War II, the B-17 flew many of the daylight bombing missions in enemy airspace. These were the most dangerous missions of the war. Many of these B-17s did not make it back. Some of them made it back "on a wing and a prayer" as it was described by many of the aircrew.

These planes made it back and the crews lived to fight another day because the aircraft was incredibly resilient. Where a few shots would take out other types of planes, these could take hit after hit and still keep flying. The planes would lose engine after engine, ailerons, rudder, and the crews would still make it back to a bumpy, but survivable, landing.

We could use more of those design philosophies and operational ideals these days. With the AWS and Azure outages of the last few weeks, we saw how seemingly minor issues could cripple large portions of the economy for long periods of time. None of this was because the systems couldn't be resilient. It's because people did not consider resiliency in their designs. They put all their eggs in one basket with the belief that companies as large as Amazon and Microsoft would never have problems.

As a product of massively large institutions, I can tell you that broad and deep resiliency is probably out of reach for many small businesses. Your cloud service bills would double or triple. Your administrative burden in finding properly resilient vendors would be more that you could ever estimate. Fortunately for small businesses, the need for this split second recovery is rarely there. There are things you can do to lessen the effects of vendor outages, natural disasters, human error, and other hiccups, though.

First and foremost, you need to understand your uptime need. Can you be down for an hour? A day? A week? Do you need some number of nines in your uptime? If you don't understand your true need, you will either overshoot or underestimate what's needed on the back-end. Work with your cybersecurity and IT pros to determine that number and what's reasonable for your business and your budget.

If you are just trying to reduce the impact of these major vendor outages, there are other less consuming things you can do, as well:

  • Don't put your cloud-based systems in the default data centers. AWS us-east-1 is the default. That's why there's so much there. People just fail to explore other options.
  • Design backup processes around outages. Is Office 365 down? How can you continue working via phone calls or Zoom meetings? Can you pivot to less communicative tasks or projects to effectively use the time?
  • Ask your SaaS vendors about their cloud resilience strategy. I'm sure the major SaaS vendors will be getting those questions a lot over the next few weeks and months. How does QuickBoooks survive through an outage? Your ERP? Your EHR?
  • For those who host internal infrastructure, evaluate your backup battery capacity. If the time available is too short, consider increasing capacity. At least have an adequate backup battery on your internet infrastructure (e.g., modem, router, wireless network devices, etc.).
  • If you are still reliant on internal infrastructure, consider a move to a cloud provider. It doesn't have to be all or none but consider placing your most used and needed systems at a cloud provider (just not in us-east-1!). Even with the recent issues, cloud platforms are more resilient than doing it yourself on-premises.

None of these are silver bullets and you should always work with experienced professionals to determine your needs and capabilities before making any moves. Let us know if you'd like some help.

Reference:
https://fortune.com/2025/10/21/why-did-amazon-web-services-fail-virginia-data-center-outage-internet/