Amitav Roy

Multi-Cloud Redundancy: The Gap in Our Resilience Thinking


Published on: 19th Mar, 2026 by Amitav Roy
Multi-region is not enough. Recent outages exposed the gap between region and provider redundancy.

Recent cloud outages made me rethink something I thought we had solved.

We had multi-region deployments. We had failover logic. We had redundancy drawn into architecture diagrams. And then I watched AWS take down entire data centres — backups included — and realised none of that actually mattered in the way I thought it did.

Because we were still on one provider.


The Mental Model We Got Wrong

Most resilience thinking in the industry has evolved around a single axis: geography.

If one region fails, traffic routes to another. If one availability zone has an issue, your replicas in a second AZ take over. This is good engineering. It solves a real class of problems.

But it doesn't solve the class of problems where the provider itself has an incident.

A cloud provider's control plane failure, a DDoS at their edge layer, or a data centre power event can cascade across every region they operate. Your beautifully orchestrated multi-region setup collapses — because all the regions are still managed by the same control plane, share the same DNS resolution path, and sit behind the same CDN infrastructure.

Region redundancy ≠ Provider redundancy.

These are two different failure domains. We've been optimising for one and ignoring the other.


Why Cloudflare-Class Outages Are Different

When your application server goes down, you failover. When your database replica lags, you promote the secondary. These are well-understood failure modes with well-understood recovery paths.

Cloudflare-class outages operate at a different layer. Cloudflare isn't just a CDN — it's DNS resolution, DDoS protection, TLS termination, and routing logic for a massive portion of internet traffic.

When that layer has an incident:

  • Your app is running fine — but users can't reach it
  • DNS queries fail or resolve incorrectly
  • TLS handshakes break
  • Your monitoring may also be affected — because it routes through the same infrastructure

You can't failover your way out of this. There's no replica to promote. The failure isn't in your infrastructure — it's in the layer beneath it.

This is the category of incident that exposes provider-level risk in the starkest way possible.


What Happened With the AWS Data Centre Incidents

The AWS incidents that caught my attention weren't just service degradations. Some affected entire physical data centres in ways that made backups unavailable alongside primary infrastructure.

This breaks a fundamental assumption most teams make: that backups stored in the same provider's object storage are independent from primary compute failures.

They're not. If the provider has a region-level or storage-level incident, your S3 backups, your RDS snapshots, your EBS volumes — all of it sits in the same blast radius.

For teams running on a single cloud provider:

  • Primary compute: AWS
  • Database: AWS RDS
  • Backups: AWS S3 (different region, same provider)

An AWS-wide incident collapses all three simultaneously. The architecture looks redundant on paper. The failure domain is a single entity.


What Multi-Provider Redundancy Actually Looks Like

This is not about rewriting your entire infrastructure. Most teams can introduce meaningful provider redundancy at specific layers without a full multi-cloud migration.

1. DNS-layer redundancy

Split your DNS resolution across two providers. Use a primary and secondary nameserver from different vendors. If one provider has a DNS incident, queries still resolve through the other.

Primary -> Cloudflare DNS Secondary -> Route 53 or another independent DNS provider

This is low-effort and high-impact for a specific but critical failure mode.

2. CDN failover

Maintain a secondary CDN configuration. Use health-check-based routing at the DNS level to switch traffic if your primary CDN becomes unresponsive. Most DNS providers support this natively.

The secondary CDN doesn't need to be warm for everything — start with your most critical paths.

3. Compute warm standby

For revenue-critical services, maintain a warm standby on a second provider. This doesn't mean active-active across two clouds — that's operationally expensive. A warm standby with automated failover for your highest-risk paths is a more realistic starting point.

Primary: AWS (active) Secondary: GCP or Azure (warm, receives no traffic until failover triggers)

4. Backup independence

This is the one that changes immediately after reading about the AWS data centre incidents.

Your backups must live on a different provider than your primary infrastructure. Not S3 in us-east-1 and us-west-2. S3 in AWS and object storage in a second provider.

Primary compute: AWS Backups: Cloudflare R2 or Google Cloud Storage or Backblaze B2

The backup storage choice here matters less than the provider independence.


The Trade-offs Are Real

I'm not going to pretend this is straightforward. Multi-provider redundancy introduces genuine complexity:

  • Operational overhead increases — IAM policies, credentials, tooling, and observability now span multiple providers
  • Cost goes up — warm standby capacity on a second provider isn't free, and you're paying for infrastructure that handles no production traffic under normal conditions
  • Testing failover is harder — cross-provider failover testing is more complex than cross-region failover within a single provider
  • Data consistency requires careful design — if your standby needs to be in sync, replication across provider boundaries adds latency and cost

Not every service justifies this. A low-traffic internal tool doesn't need multi-provider redundancy. But for systems where an outage has direct revenue or compliance consequences — the math usually works.


A Practical Starting Point

You don't need to solve all of this at once. A reasonable progression:

  1. Audit your backup strategy first — move backups to a provider-independent location. This is the lowest-effort change with the highest risk reduction.
  2. Add DNS redundancy — configure a secondary nameserver on a different provider.
  3. Map your critical paths — identify which services, if unavailable for 4 hours, would cause the most damage. Start the warm standby work there.
  4. Test your failover — multi-provider failover that has never been tested is not a failover plan. It's a hope.

👉 We've spent years solving "what if a region fails?" We need to also be solving "what if a provider fails?"

Recent events made that gap visible in a way that architecture diagrams never did. The assumption that multi-region equals resilient is comfortable — and for many common failure modes, it's correct. But it leaves an entire failure domain unaddressed.

Have any recent outages changed how your team thinks about cloud provider dependency? And how have you approached the trade-off between operational complexity and provider-level redundancy?


Visual Ideas

1. The Two-Axis Diagram A 2x2 grid. X-axis: geographic redundancy (low to high). Y-axis: provider redundancy (low to high). Most teams are plotted in the top-left quadrant — high geographic redundancy, zero provider redundancy. The "actually resilient" zone is the top-right. Label real architecture patterns in each quadrant.

2. The Blast Radius Map A series of concentric circles. Innermost: AZ failure. Middle: Region failure. Outer: Provider failure. Most redundancy investments are in the inner two circles. The outer circle is empty. Annotate the real events that fall into each ring.

3. The Backup Independence Diagram Two architecture stacks side by side. Left: all components (compute, DB, backups) inside one cloud provider box — connected by a single red "blast radius" boundary. Right: compute and DB in provider A, backups in provider B — the blast radius from provider A stops at the boundary and doesn't reach the backups.

4. The Dependency Chain A vertical chain diagram showing layers: Application -> Compute -> DNS -> CDN -> Physical Data Centre. Annotate each layer with which provider owns it. For a typical AWS + Cloudflare setup, show how a Cloudflare incident severs the chain at a layer above the application, making application-level failover irrelevant.


LinkedIn Announcement

Recent cloud outages changed how I think about resilience.

We had multi-region. We had failovers. We thought we had redundancy covered.

But all of it sat inside one provider.

Region redundancy and provider redundancy are two different failure domains. Most teams have optimised heavily for one and ignored the other entirely.

Wrote about what that gap looks like, why Cloudflare-class outages are a different category of problem, and what a practical starting point for multi-provider redundancy looks like.

Link in comments.

How has your team thought about cloud provider dependency — and has any recent outage changed that thinking?