architectureSAP-C02

Disaster Recovery Strategies: From Backup Vaults to Always-On Resilience

Master the four DR archetypes — RTO, RPO, cost, and complexity — to ace SAP-C02 scenario questions

Updated 2026-02-22

Overview

Disaster Recovery (DR) on AWS defines how quickly and completely you can restore operations after a failure, governed by two core metrics: Recovery Time Objective (RTO — how long you can be down) and Recovery Point Objective (RPO — how much data loss is acceptable). AWS formalizes four DR strategies on a spectrum from lowest cost/highest RTO to highest cost/lowest RTO: Backup & Restore → Pilot Light → Warm Standby → Multi-Site Active/Active. For SAP-C02, you must be able to map business requirements directly to the correct strategy, justify the cost/complexity tradeoff, and identify which AWS services implement each pattern.

SAP-C02 dedicates significant scenario weight to selecting the right DR strategy given constraints like budget, acceptable downtime, data loss tolerance, and compliance requirements — getting this wrong is a common differentiator between pass and fail.

Patterns & Strategies

Backup & Restore

The simplest and cheapest DR strategy. Data is backed up to durable storage (Amazon S3, S3 Glacier, AWS Backup) on a scheduled basis. In a disaster, infrastructure is provisioned from scratch using IaC (CloudFormation, CDK), and data is restored from backups. No standby infrastructure runs in the DR Region. RTO is measured in hours; RPO is determined by backup frequency (could be hours to days).

✓

Use when budget is the primary constraint, downtime of several hours is acceptable, and data loss of hours is tolerable. Ideal for dev/test environments, archival systems, or non-critical workloads. Also appropriate when compliance requires long-term data retention (e.g., S3 Glacier Vault Lock).

⚠

Lowest cost — you only pay for storage, not compute. However, RTO is the worst of all four strategies (hours to potentially a day+). Recovery is manual and error-prone without well-tested runbooks. RPO depends entirely on backup cadence — infrequent backups mean significant data loss.

Pilot Light

A minimal version of the critical core (the 'pilot light') is always running in the DR Region — typically just the database tier replicated in near-real-time (e.g., RDS Multi-Region read replica, Aurora Global Database, DynamoDB Global Tables). Application servers and non-critical components are NOT running; they exist only as AMIs, Launch Templates, or CloudFormation templates ready to be activated. In a disaster, you scale out the pre-configured core and launch the application tier. RTO is measured in tens of minutes; RPO is near-zero for data (due to continuous replication) but application recovery still takes time.

✓

Use when data loss must be near-zero (critical transactional data) but some application downtime (tens of minutes) is acceptable. Good for production workloads where budget prevents full warm standby but data integrity is paramount — financial systems, order management.

⚠

Moderate cost — you pay for the always-on database replicas and minimal supporting infrastructure. Application tier costs are near-zero until failover. Complexity is moderate: requires automation (Auto Scaling, CloudFormation) to scale out quickly. The 'light' must be kept in sync with production config — configuration drift is a real operational risk.

Warm Standby

A scaled-down but fully functional version of the production environment runs continuously in the DR Region. All tiers (web, app, database) are live but at reduced capacity (e.g., smaller instance types, fewer instances). Data is continuously replicated. In a disaster, you scale up the standby environment to full production capacity using Auto Scaling and Route 53 failover. RTO is measured in minutes; RPO is near-zero.

✓

Use when both RTO (minutes) and RPO (near-zero) are required but the cost of full active/active is prohibitive. Appropriate for business-critical applications where extended downtime causes significant revenue loss — e-commerce platforms, SaaS applications, internal ERP systems.

⚠

Higher cost than Pilot Light because all tiers run continuously (even at reduced capacity). Scaling up during a real disaster must be tested regularly — untested scale-up procedures are a common failure point. Can also serve limited read traffic in non-disaster periods to offset cost.

Multi-Site Active/Active

The production workload runs simultaneously at full capacity in two or more AWS Regions (or AZs). Traffic is distributed across all sites using Route 53 (latency-based, weighted, or geolocation routing), Global Accelerator, or CloudFront. Data is synchronously or asynchronously replicated across sites (Aurora Global Database, DynamoDB Global Tables). In a 'disaster,' traffic is simply shifted away from the impacted site — users may experience no disruption at all. RTO approaches zero; RPO approaches zero.

✓

Use when zero (or near-zero) downtime and zero data loss are non-negotiable business requirements — banking, healthcare, mission-critical government systems, large-scale e-commerce (e.g., Black Friday). Also appropriate when global latency optimization is a secondary benefit.

⚠

Most expensive strategy — you pay for full production capacity in every site simultaneously. Architectural complexity is highest: requires stateless application design, distributed data consistency strategies, conflict resolution for writes, and sophisticated traffic management. Testing is complex because the 'DR' site IS production.

Multi-AZ (Regional High Availability — not cross-region DR)

While not one of the four formal DR strategies, Multi-AZ is a foundational availability pattern that is frequently tested alongside DR. Services like RDS Multi-AZ, ELB, and Auto Scaling across AZs provide high availability within a single Region. Failover is automatic and typically completes in 60–120 seconds for RDS. This protects against AZ-level failures but NOT Regional disasters.

✓

Use as the baseline for ANY production workload. Multi-AZ should always be enabled before considering cross-region DR strategies. It is NOT a substitute for cross-region DR when the requirement is Regional resilience.

⚠

Relatively low cost increment over single-AZ deployments. Automatic failover reduces operational burden. However, it does not protect against Region-wide events, data corruption, or accidental deletion — those require cross-region strategies.

AWS Elastic Disaster Recovery (DRS)

A managed AWS service (formerly CloudEndure) that provides continuous block-level replication of servers to a staging area in AWS. In a disaster, machines can be launched in minutes with minimal data loss. Supports replication from on-premises, other clouds, or other AWS Regions. Eliminates the need to manually build Pilot Light or Warm Standby architectures for server-based workloads.

✓

Use when migrating from on-premises to AWS DR, or when you need managed continuous replication without building custom replication pipelines. Particularly effective for lift-and-shift DR scenarios where refactoring to cloud-native services is not yet complete.

⚠

Charged per replicated server per hour plus EBS storage for staging area. Simplifies DR implementation but introduces service dependency. Not a replacement for cloud-native DR patterns for greenfield AWS workloads.

Decision Framework

STEP 1 — Determine RTO requirement:

• RTO = hours acceptable → Backup & Restore is sufficient

• RTO = tens of minutes acceptable → Pilot Light

• RTO = minutes acceptable → Warm Standby

• RTO = near-zero / zero → Multi-Site Active/Active

STEP 2 — Validate against RPO requirement:

• RPO = hours acceptable → Backup & Restore (schedule backups accordingly)

• RPO = near-zero, but app downtime OK → Pilot Light (continuous DB replication)

• RPO = near-zero, app downtime in minutes OK → Warm Standby

• RPO = zero → Multi-Site Active/Active

NOTE: If RTO and RPO point to different strategies, always choose the MORE stringent (higher tier) strategy.

STEP 3 — Apply budget constraint as a filter:

• Lowest cost → Backup & Restore

• Moderate cost → Pilot Light

• Higher cost → Warm Standby

• Highest cost → Multi-Site Active/Active

NOTE: Budget can eliminate options but NEVER override RTO/RPO requirements in exam scenarios — the question will make one constraint dominant.

STEP 4 — Check for specific service signals in the question:

• 'Continuous replication' + 'database always running' → Pilot Light

• 'Scaled-down environment running' → Warm Standby

• 'Traffic split across regions' or 'Route 53 weighted routing' → Active/Active

• 'S3 cross-region replication' + 'CloudFormation' → Backup & Restore

• 'Aurora Global Database' or 'DynamoDB Global Tables' → Pilot Light or Active/Active

• 'AWS Elastic Disaster Recovery' → managed replication, typically Pilot Light equivalent

STEP 5 — Validate against compliance/data sovereignty if mentioned:

• Some regulations require data to remain in specific Regions — this may constrain which cross-region strategies are viable

• AWS Backup with cross-region copy can satisfy backup requirements while maintaining compliance

Exam Tips

criticalRTO/RPO spectrum

RTO and RPO are the ONLY two metrics that determine DR strategy selection on the exam. Memorize the spectrum: Backup & Restore (highest RTO/RPO, lowest cost) → Pilot Light → Warm Standby → Active/Active (lowest RTO/RPO, highest cost). Every DR scenario question maps to this spectrum.

criticalPilot Light vs Warm Standby

Pilot Light and Warm Standby are the most commonly confused strategies. The KEY differentiator: Pilot Light has ONLY the data/core tier running (app servers are OFF, launched at failover). Warm Standby has ALL tiers running at REDUCED capacity (app servers ARE running, scaled up at failover). If the question says 'scaled-down environment,' it's Warm Standby.

criticalAurora Global Database, DynamoDB Global Tables

Aurora Global Database is the canonical exam answer for cross-region database replication in Pilot Light and Warm Standby scenarios. It provides sub-second RPO with typically less than 1 second replication lag and can be promoted in under 1 minute. DynamoDB Global Tables is the equivalent for NoSQL. Know which service maps to which DR tier.

criticalRoute 53 failover routing, Global Accelerator

Route 53 health checks + failover routing records are the traffic-switching mechanism for all cross-region DR strategies. For Active/Active, use weighted or latency-based routing. For active/passive (Pilot Light, Warm Standby), use failover routing with primary/secondary records. Global Accelerator is an alternative that provides faster failover (uses anycast IP, no DNS TTL delays).

criticalMulti-AZ vs Multi-Region

Multi-AZ is NOT a DR strategy — it is a high availability feature within a single Region. Questions that require protection against a 'regional outage' or 'regional disaster' require cross-region solutions. Multi-AZ alone is a wrong answer for any scenario involving regional failure.

critical

Map RTO/RPO directly to strategy tier: hours→Backup & Restore, tens of minutes→Pilot Light, minutes→Warm Standby, near-zero→Active/Active. This single framework answers 80% of DR scenario questions.

critical

Pilot Light = ONLY data tier running (app servers OFF). Warm Standby = ALL tiers running at reduced capacity. This distinction is the most tested differentiator between the two middle strategies.

critical

Multi-AZ ≠ DR. Multi-AZ protects against AZ failure within one Region. Any question involving 'regional outage' or 'entire region unavailable' requires a cross-region DR strategy.

importantAWS Backup

AWS Backup with cross-region copy is the managed implementation of Backup & Restore. It supports EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, and more. For exam scenarios asking about centralized backup management across accounts and regions, AWS Backup with Organizations integration is the correct answer.

importantCost optimization in DR

When a question mentions 'cost optimization' alongside DR, look for opportunities to use the DR environment productively: Warm Standby can serve read traffic (RDS read replica), Pilot Light databases can serve analytics queries, Active/Active inherently uses all capacity. This is a common 'best of both worlds' scenario in SAP-C02.

importantS3 CRR, S3 RTC

S3 Cross-Region Replication (CRR) is a key enabler of Backup & Restore and Pilot Light strategies. Combined with S3 Versioning, it provides near-continuous data protection for object-based workloads. S3 Replication Time Control (RTC) provides an SLA of replicating 99.99% of objects within 15 minutes — use this when RPO must be bounded.

importantAWS Elastic Disaster Recovery

AWS Elastic Disaster Recovery (DRS) is the exam answer when the scenario involves on-premises servers or non-cloud-native workloads needing DR to AWS. It provides continuous block-level replication and point-in-time recovery. Do not confuse it with AWS Database Migration Service (DMS), which is for database migration, not DR.

importantActive/Active data consistency

For Active/Active architectures, data consistency is the hardest problem. DynamoDB Global Tables uses last-writer-wins conflict resolution. Aurora Global Database uses a single primary writer (writes must go to the primary region — it is NOT truly active/active for writes unless you architect around it). Know these limitations for write-heavy workload scenarios.

Good to KnowCloudFormation StackSets

CloudFormation StackSets enable deploying DR infrastructure across multiple accounts and regions simultaneously — a critical enabler of Infrastructure as Code (IaC)-based DR. In exam scenarios about 'automating DR infrastructure deployment,' StackSets is frequently the correct answer.

Common Misconceptions & Traps

Common Mistake

Pilot Light means a small EC2 instance is always running in the DR region as a 'placeholder' for the application.

Correct

In Pilot Light, the APPLICATION tier is completely OFF (no running instances). Only the DATA tier (database, core replication) runs continuously. The application servers exist only as AMIs or Launch Templates and are launched only during failover. Running a small app instance would make it Warm Standby, not Pilot Light.

This is the #1 confusion between Pilot Light and Warm Standby on the exam. The 'pilot light' metaphor refers to a gas furnace — the tiny flame (data tier) is always on, ready to ignite the full system, but the furnace (app tier) is off. If you see 'application servers running at reduced capacity' → Warm Standby. If you see 'only database replication active' → Pilot Light.

Common Mistake

Multi-AZ deployments protect against regional disasters and can substitute for cross-region DR.

Correct

Multi-AZ provides availability within a SINGLE AWS Region across physically separated data centers. A regional event (natural disaster, major infrastructure failure affecting an entire Region) would impact all AZs in that Region simultaneously. Cross-region DR strategies (Pilot Light, Warm Standby, Active/Active) are required for regional resilience.

This misconception causes candidates to select Multi-AZ solutions for questions explicitly mentioning 'regional outage' or 'entire region becomes unavailable.' Always check whether the failure scenario is AZ-level (Multi-AZ solves it) or Region-level (requires cross-region DR).

Common Mistake

Active/Active DR means both regions handle writes simultaneously with no constraints.

Correct

True symmetric active/active with multi-master writes is architecturally complex and service-dependent. Aurora Global Database has a SINGLE primary writer region — secondary regions are read-only replicas (read/write promotion takes under 1 minute during failover but is not simultaneous write). DynamoDB Global Tables supports multi-region writes with last-writer-wins conflict resolution. Stateless application layers CAN be truly active/active, but the database layer almost always has constraints.

Exam scenarios may present a 'gotcha' where a candidate selects Active/Active but the described architecture has a write-intensive workload that conflicts with the replication model. Understanding per-service limitations prevents selecting an infeasible architecture.

Common Mistake

A lower RPO always requires a more expensive DR strategy.

Correct

RPO is primarily about DATA replication frequency, not infrastructure tier. You can achieve near-zero RPO with Pilot Light (continuous database replication) while still having a non-trivial RTO (minutes to tens of minutes for app tier launch). The cost driver for higher tiers is primarily RTO improvement (keeping compute running), not RPO improvement.

This misconception leads candidates to over-engineer (and over-spend) when a question only requires near-zero RPO but can tolerate moderate RTO. Pilot Light with Aurora Global Database achieves near-zero RPO at much lower cost than Warm Standby or Active/Active.

Common Mistake

Backup & Restore is only relevant for archival/compliance use cases and is never the right answer for production workloads.

Correct

Backup & Restore is the correct answer for ANY production workload where the business explicitly accepts hours of RTO and hours of RPO, especially when cost minimization is the primary driver. Many legitimate production workloads (internal tools, batch processing systems, development/staging environments) have these characteristics. Never dismiss it as 'too simple' without checking the stated RTO/RPO requirements.

Candidates sometimes over-engineer DR solutions on the exam by defaulting to Warm Standby or Active/Active. If the question states 'RTO of 4 hours is acceptable' and 'minimize cost,' Backup & Restore is the correct and defensible answer.

Common Mistake

RTO and RPO are the same thing — both measure how quickly you recover from a disaster.

Correct

RTO (Recovery Time Objective) measures how long your system can be UNAVAILABLE — it's about time-to-restore service. RPO (Recovery Point Objective) measures how much DATA LOSS is acceptable — it's about the maximum age of data you can recover to. A system could have RTO=4 hours (service can be down for 4 hours) but RPO=1 hour (you cannot lose more than 1 hour of data), requiring frequent backups even with a slow recovery process.

Confusing RTO and RPO leads to selecting wrong DR strategies. Always identify both metrics independently in a scenario question before matching to a strategy. They can point to different tiers, and you must satisfy BOTH.

Memory Tricks

🧠

DR Cost Ladder (cheapest to most expensive): 'Brave Pilots Warm the Active fire' → Backup & Restore → Pilot Light → Warm Standby → Active/Active

🧠

RTO vs RPO: RTO = 'Return To Operations' (how long until you're back up). RPO = 'Return Point Objective' (how far back in time you can go for data).

🧠

Pilot Light vs Warm Standby: 'Pilot = only the PILOT (data/core) is running. Warm = the whole engine is WARM (all tiers running, just smaller).'

🧠

Active/Active signals in exam questions: look for 'Route 53 weighted routing,' 'traffic split,' 'both regions serving production traffic,' 'zero RTO,' or 'zero RPO' — any of these strongly suggest Active/Active.

🧠

The Four DR Strategies as disaster preparedness analogies: Backup & Restore = 'evacuation plan only, rebuild from scratch'; Pilot Light = 'generator fueled, waiting to start'; Warm Standby = 'backup facility open but understaffed'; Active/Active = 'two fully staffed offices, both operational'

Common Trap

Selecting Warm Standby when the question describes Pilot Light (or vice versa) — specifically, assuming that because a database is 'always running' in the DR region, the strategy must be Warm Standby. The correct differentiator is whether the APPLICATION TIER is also running (Warm Standby) or completely off/unprovisioned (Pilot Light). Always look for what is running in the DR region, not just what is replicated.

CertAI Tutor · SAP-C02 · 2026-02-22

Ready to test your knowledge?

Practice SAP-C02 exam questions with AI-powered explanations — free to start.