Cloud Is Not a Disaster Strategy

Inhaltsverzeichnis

What recent regional disruptions remind us about resilience, locality, and architecture
Cloud abstracts infrastructure but not geography
Data locality and concentration risk
Disaster recovery has always required verification
Cold standby as a practical recovery model
Kubernetes and the portability assumption
Preparation makes the difference
Disaster recovery needs practice – Let the Chaos Monkey do the nasty work
The architectural question that matters

What recent regional disruptions remind us about resilience, locality, and architecture

Recent geopolitical tensions in the Middle East coincided with service disruptions across parts of a major hyperscale cloud platform. Public reporting indicated that more than one region and several availability zones experienced degradation during the same period. Situations like this are complicated and affect far more than technology systems, so the goal here is not speculation or blame. Operating cloud infrastructure at planetary scale is an extraordinary engineering challenge, and the reliability achieved by hyperscalers over the past decade is remarkable.

However, events like these highlight an architectural misconception that has quietly become common in many organizations. Somewhere along the way, the industry began to treat „moving to the cloud“ as if it also meant „disaster recovery is handled“. As long several AZs are involved, people feel convident enough to tick the box off.

Unfortunately, this assumption was never part of the deal.

Cloud platforms provide extremely powerful building blocks, but they do not remove the responsibility to design recovery strategies. In many ways they make it easier to implement one, but they do not define it for you.

Understanding that distinction is where resilience actually begins.

Cloud abstracts infrastructure but not geography

Most modern cloud architectures are built around the idea of high availability. Applications are distributed across multiple availability zones, storage systems replicate data automatically, and load balancers reroute traffic when infrastructure components fail. These mechanisms work extremely well when the failure is limited to a hardware component, a rack, or even an entire datacenter.

But in fact, there is a boundary of those protections mechanism. It’s is the region itself.

A cloud region is not just a logical construct. It is a physical environment that depends on power infrastructure, network backbones, routing agreements, and regulatory frameworks. When disruption affects a region as a whole, the redundancy inside that region still exists, but it can no longer provide continuity. This means that a system whose identity services, storage systems, container registries, CI pipelines, and workloads all reside within one region ultimately shares the same geographic risk.

Cloud platforms abstract hardware management and drastically reduce operational complexity. Geography, however, remains a real thing.

Data locality and concentration risk

Many organizations intentionally place their systems in a specific region because of data locality requirements. Keeping customer data within national borders is often required by regulation and is frequently the correct decision.

Usually, problems begin when locality becomes the only design principle. Systems then follow the comfortable path of regulatory simplicity, but architecture designed primarily for lawyers tends to overlook the technical failure domains engineers eventually have to deal with.

A typical pattern looks something like this: primary databases run in a single region, backups are stored in the same region, container images live in the regional registry, Kubernetes clusters depend on the regional control plane, and identity services are also hosted there. From a compliance perspective everything is perfectly aligned.

From a resilience perspective, the entire system now shares the same geographic impact radius. Compliance answers the question of where data is allowed to reside. Disaster recovery answers the question of where the system can continue operating if something goes wrong.

These questions overlap, but they are not identical. Treating them as the same architectural concern often leads to hidden concentration risk.

Disaster recovery has always required verification

The challenge of recovery planning did not start with cloud computing.

In the early 2000s, shortly after my job training, I worked in environments that relied on automated tape backup systems. Every night the tape robot performed its routine with mechanical precision. Cartridges were loaded, rotated, archived, and the logs confirmed successful execution. From the outside the system looked perfectly healthy.

Until the day a disk system failed and we initiated the first real restore. The robot had performed its choreography flawlessly for months. The movements were precise. The logs were green. Everything suggested that the backup system was working. What nobody realized was that not a single byte had ever been written to the tapes. The robot was doing exactly what it had been configured to do. The process existed, but the validation never happened.

The lesson from that incident still applies today. Disaster strategies fail far more often because they were never exercised than because they were never written down.

Cold standby as a practical recovery model

Not every organization needs a fully active secondary environment running in another region. In many cases that would introduce unnecessary cost and operational complexity. A cold standby model often provides a far more practical balance between resilience and efficiency. The idea is straightforward. Instead of running duplicate compute infrastructure continuously, the architecture ensures that the critical building blocks already exist outside the primary region.

In cloud environments this approach is particularly attractive because the underlying hardware capacity already exists. You are not buying and maintaining idle servers. Instead, you prepare the environment so that it can be recreated quickly when needed.

Infrastructure definitions should live in infrastructure-as-code tools such as Terraform so that networks, security policies, and compute resources can be provisioned automatically. Container images can be replicated into a secondary container registry, and storage systems can replicate snapshots or objects into another region. When recovery becomes necessary, compute resources are started from code and connected to the already replicated data.

Storage is relatively inexpensive compared to continuously running compute capacity, which means replicating images, backups, and snapshots usually adds minimal cost. When recovery becomes necessary, compute resources are provisioned from code and attached to the already replicated storage layers. If the system has been designed carefully, services can be restored within hours instead of days.

The key requirement is reproducibility. Recovery must depend on code that has been executed before, not on documentation that someone hopes will work under pressure.

Kubernetes and the portability assumption

Containerization and Kubernetes have made application portability significantly easier, but they do not eliminate disaster recovery challenges. Managed Kubernetes clusters rely on regional control planes. Persistent volumes and load balancers are typically tied to regional infrastructure services. Identity integrations and network configurations also tend to depend on regional components.

This means that rebuilding a cluster in another region still requires preparation.

Cluster configuration should ideally be maintained through GitOps workflows so that new clusters can be recreated deterministically. Stateful services should replicate data across regions or export regular snapshots. Teams should occasionally rebuild clusters in another region to verify that the entire environment can actually be recreated without manual adjustments.

Portability only becomes real once it has been exercised.

Preparation makes the difference

One example that received attention during the recent disruptions came from Careem. The company shared that it managed to migrate critical workloads to another geographic region in less than a day (Link). Moves like that rarely happen spontaneously. They require infrastructure that can be reproduced quickly, data that already exists outside the primary region or at least ready to move fast, and operational teams that know exactly how the recovery procedure works.

From the outside such migrations appear extremely fast. In reality they are almost always the result of preparation that took months or years.

I am curious to learn more about how they approached the migration and will certainly keep an eye open for opportunities to hear their engineering teams speak about it at one of the regional meetups. Examples like this are valuable because they show what disciplined preparation can achieve.

Disaster recovery needs practice – Let the Chaos Monkey do the nasty work

One of the most common weaknesses in resilience strategies is the absence of drills. Architecture diagrams often look robust on paper, but real systems tend to reveal hidden dependencies only when they are tested under stress.

The situation is not very different from traditional safety drills in the physical world. Fire drills exist for a reason. Buildings can have perfectly designed evacuation plans, clearly marked escape routes, and compliant safety documentation. Yet it is not unusual for the first real drill to reveal something unexpected. Sometimes a door is locked, an exit is blocked by stored equipment, or a staircase that looked perfectly accessible on paper turns out to be difficult to use in practice. The plan itself may have been correct, but only the exercise exposed the gap between theory and reality.

Infrastructure behaves in a similar way. Systems that appear resilient in architecture diagrams often depend on hidden operational assumptions. A service may rely on a DNS provider that lives in the same region as everything else. A certificate authority might be reachable only through a network path that nobody considered critical. An identity provider might become unavailable precisely when systems need it most during recovery.

Controlled recovery exercises are one of the most effective ways to uncover these dependencies before a real incident does. Teams can simulate regional outages, deliberately disable infrastructure components, or temporarily isolate parts of the system to observe how applications react. The objective is not to create chaos for its own sake. It is to understand how the system behaves and how long recovery actually takes.

This idea became well known when Netflix introduced its Chaos Monkey tool as part of the Simian Army engineering practices. The principle was simple. Instead of waiting for failures to occur randomly, they deliberately introduced controlled failures into production environments to ensure that systems could tolerate them. Over time this approach evolved into a broader discipline known as chaos engineering. The underlying lesson remains relevant: resilience improves when systems are exposed to controlled failure conditions before uncontrolled ones appear.

Without exercises, disaster recovery plans remain theoretical. With them, organizations begin to understand how their systems actually behave under stress.

The architectural question that matters

Conversations about cloud reliability often drift into comparisons between providers or debates about infrastructure philosophy. In practice, those discussions rarely lead very far. A more useful starting point is to look directly at the system itself and ask a simple question.

What would actually happen if the region hosting your system became unavailable for seventy two hours?

Which services would degrade gracefully and which would stop entirely. How long would it take to bring compute resources online elsewhere. How much data loss would be acceptable in contractual terms rather than theoretical ones.

Cloud platforms make it possible to build global systems faster than ever before. They remove a large portion of the operational burden and provide infrastructure capabilities that previously required entire teams to operate.

What they do not provide is a disaster strategy.

Designing recovery paths, defining acceptable loss, and validating that those mechanisms actually work still belongs to the engineers building the system. Infrastructure has been repeating the same lesson for decades, from tape robots to container platforms.

A recovery plan that has never been exercised usually works perfectly on paper. Right up to the moment it is needed.