architecture

High Availability & Fault Tolerance Patterns: Build Systems That Never Sleep

Master the architectural patterns that keep AWS workloads running through failures, disasters, and surges — and ace every resilience question on cert exams.

Updated 2026-02-22

Overview

High Availability (HA) ensures a system remains operational with minimal downtime, typically expressed as a percentage uptime (e.g., 99.99%). Fault Tolerance goes further — the system continues operating correctly even when one or more components fail, with zero perceptible impact to users. Understanding the distinction, the AWS services that enable each pattern, and the trade-offs between cost and resilience is essential for every AWS certification from Cloud Practitioner through Professional and Specialty levels.

Certification exams heavily test your ability to select the RIGHT resilience pattern for a given scenario — distinguishing HA from FT, active-active from active-passive, and pilot light from warm standby. Getting these wrong is the #1 cause of failure on Solutions Architect exams.

Patterns & Strategies

Multi-AZ Deployment (Active-Active / Active-Passive)

Resources are deployed across two or more Availability Zones within a single AWS Region. In active-active mode, all instances serve traffic simultaneously (e.g., ALB distributing to EC2 in AZ-a and AZ-b). In active-passive mode, a standby resource is promoted only on failure (e.g., RDS Multi-AZ — standby is NOT readable). AZs are physically separate data centers with independent power, cooling, and networking, connected via low-latency links.

✓

Use Multi-AZ for any production workload requiring protection against a single AZ failure. It is the baseline HA pattern for databases (RDS, ElastiCache), compute (EC2 with ASG), and load balancers (ALB/NLB are inherently multi-AZ). Required whenever the exam asks for 'high availability within a region.'

⚠

Approximately 2x cost for standby resources. Failover for RDS Multi-AZ takes 60–120 seconds. Standby RDS instance cannot serve read traffic — use Read Replicas separately for read scaling.

Multi-Region Active-Active

The application runs simultaneously in two or more AWS Regions, with traffic routed via Route 53 (latency-based, geolocation, or weighted routing). Data is replicated bidirectionally (e.g., DynamoDB Global Tables, Aurora Global Database). Any region can serve reads AND writes. RTO and RPO approach zero for regional failures.

✓

Use when the business cannot tolerate any regional outage and requires near-zero RTO/RPO. Ideal for globally distributed applications, financial systems, and gaming platforms. Also chosen when global latency reduction is a primary requirement alongside resilience.

⚠

Highest cost and complexity. Requires conflict-resolution strategies for bidirectional writes. Aurora Global Database has a typical replication lag of under 1 second. DynamoDB Global Tables use last-writer-wins conflict resolution.

Multi-Region Active-Passive (Pilot Light)

A minimal version of the production environment runs continuously in a secondary region — only the core, critical components (e.g., database replication, core AMIs) are kept 'lit.' On disaster, the secondary region is scaled up rapidly. Route 53 health checks detect failure and redirect DNS. RTO is measured in tens of minutes.

✓

Use when cost must be minimized but recovery must be faster than a cold backup restore. Appropriate for workloads with RTO requirements of 10–60 minutes and RPO of minutes. The exam often presents this as the answer when 'low cost DR with faster recovery than backup-restore' is the requirement.

⚠

Not zero-downtime. Scaling up the secondary region takes time. Database replication (e.g., Aurora Global Database read replica promotion) must be planned and tested. Risk of 'pilot light going out' if replication is not monitored.

Multi-Region Active-Passive (Warm Standby)

A scaled-down but fully functional version of the production stack runs in the secondary region at all times. Unlike pilot light, all application tiers are running (web, app, DB). On failover, the secondary region is scaled to full production capacity. RTO is minutes. Route 53 health checks trigger DNS failover automatically.

✓

Use when RTO must be under 15 minutes and cost is secondary to recovery speed. The standby environment can also serve reduced traffic during normal operations (e.g., internal testing, canary deployments). The exam uses this when 'faster than pilot light but cheaper than active-active' is the stated goal.

⚠

Higher cost than pilot light (full stack running at reduced capacity). Requires careful capacity planning to ensure scale-up completes within RTO. Database write lag from primary must be monitored.

Backup and Restore (Cold Standby)

Data is backed up regularly to S3, AWS Backup, or snapshots (EBS, RDS). In a disaster, infrastructure is rebuilt from IaC (CloudFormation, CDK) and data is restored from backups. No resources run in the DR region during normal operations. RTO is hours. RPO depends on backup frequency.

✓

Use for non-critical workloads where cost minimization is paramount and hours of downtime are acceptable. Also used as a supplemental strategy alongside other patterns (e.g., always have backups even if you also run warm standby). Exam: choose this when the scenario says 'lowest cost DR' with no strict RTO requirement.

⚠

Highest RTO (hours) and potentially high RPO if backups are infrequent. Restoration must be regularly tested — untested backups are a common exam trap. S3 versioning + Cross-Region Replication can reduce RPO for data.

Auto Scaling Groups (ASG) with Health Checks

ASGs automatically replace unhealthy EC2 instances and scale capacity based on demand. Health checks can be EC2-level (instance status) or ELB-level (application-level HTTP checks). ELB health checks are more accurate — they detect application failures, not just instance failures. ASGs span multiple AZs for fault tolerance.

✓

Use for any stateless compute tier. ASG is the foundational building block for HA compute on EC2. Combine with ALB for traffic distribution and ELB health checks for application-aware replacement. The exam almost always includes ASG in any HA EC2 architecture.

⚠

Scaling activities have a cooldown period. Instance replacement takes time (bootstrap, warm-up). For stateful workloads, session persistence must be handled externally (ElastiCache, DynamoDB).

Elastic Load Balancing (ELB) — ALB / NLB / GWLB

ELB distributes incoming traffic across multiple targets (EC2, containers, Lambda, IPs) in multiple AZs. ALB operates at Layer 7 (HTTP/HTTPS, path-based routing, host-based routing). NLB operates at Layer 4 (TCP/UDP, ultra-low latency, static IP, preserves source IP). GWLB operates at Layer 3 for inline network appliances. All ELB types are inherently highly available and managed by AWS.

✓

Use ALB for web applications, microservices, and container workloads. Use NLB when you need static IPs, ultra-low latency, or non-HTTP protocols. Use GWLB when routing traffic through third-party firewalls or IDS/IPS appliances. ELB is required in almost every HA architecture on the exam.

⚠

ALB has slightly higher latency than NLB. NLB does not support path-based routing. Cross-zone load balancing is enabled by default on ALB (no extra charge) but disabled by default on NLB (charged when enabled).

Route 53 Health Checks and DNS Failover

Route 53 monitors endpoint health (HTTP, HTTPS, TCP) and automatically updates DNS routing when endpoints become unhealthy. Supports failover routing (primary/secondary), weighted routing, latency-based routing, and geolocation routing. Health checks can monitor CloudWatch alarms (for private resources). TTL must be set low (e.g., 60s) for fast failover.

✓

Use for multi-region failover, global traffic distribution, and blue/green deployments. Route 53 is the DNS layer of every multi-region HA architecture. The exam tests when to use each routing policy — failover for DR, latency-based for global performance, weighted for gradual traffic shifting.

⚠

DNS TTL caching means failover is not instantaneous — clients may cache old records. Health check interval is minimum 10 seconds (standard) or 30 seconds. DNS propagation adds latency to failover. Not a replacement for application-level load balancing.

SQS-Based Decoupling (Asynchronous Fault Isolation)

Using Amazon SQS between application tiers decouples producers from consumers. If the consumer tier fails, messages queue up and are processed when the tier recovers — no data loss. SQS Standard offers at-least-once delivery; SQS FIFO offers exactly-once, ordered delivery. Dead Letter Queues (DLQ) capture messages that fail processing repeatedly.

✓

Use whenever a downstream component might be unavailable or slower than the upstream producer. Classic pattern: EC2/Lambda writes to SQS → ASG of workers polls SQS → workers process and write to DB. Exam: SQS is the answer when 'decouple' or 'handle traffic spikes without losing messages' appears in the scenario.

⚠

Introduces asynchronous processing — not suitable for workloads requiring synchronous responses. FIFO queues have throughput limits. Message retention maximum is 14 days. Visibility timeout must be tuned to prevent duplicate processing.

Chaos Engineering and Gameday Testing

Proactively injecting failures into production or staging environments to validate that HA/FT mechanisms work as designed. AWS Fault Injection Service (FIS) enables controlled experiments (terminate instances, throttle APIs, inject latency). Gamedays simulate real disaster scenarios with defined RTO/RPO targets.

✓

Use to validate DR plans, measure actual RTO/RPO, and find hidden single points of failure before a real incident does. Required for Well-Architected Framework Reliability Pillar compliance. The exam may reference FIS in reliability and operational excellence questions.

⚠

Requires mature operational practices. Risk of unintended production impact if blast radius is not controlled. Requires executive support and well-defined rollback procedures.

Decision Framework

STEP 1 — Determine RTO/RPO requirements:

• RTO = 0, RPO = 0 → Multi-Region Active-Active (DynamoDB Global Tables, Aurora Global DB, Route 53 latency routing)

• RTO < 15 min, RPO < 1 min → Multi-Region Warm Standby

• RTO 10–60 min, RPO minutes → Multi-Region Pilot Light

• RTO hours, RPO hours → Backup and Restore (lowest cost)

STEP 2 — Determine scope of failure protection:

• Protect against AZ failure only → Multi-AZ deployment (RDS Multi-AZ, ASG across AZs, ALB)

• Protect against regional failure → Multi-Region pattern (choose from Step 1)

• Protect against component failure → ASG health checks, ELB, SQS decoupling

STEP 3 — Identify the bottleneck layer:

• Compute → ASG + ALB + Multi-AZ

• Database reads → RDS Read Replicas or Aurora Replicas

• Database writes + HA → RDS Multi-AZ (synchronous replication, automatic failover)

• Database global → Aurora Global Database or DynamoDB Global Tables

• DNS/traffic routing → Route 53 with health checks

• Async workloads → SQS + DLQ

STEP 4 — Validate cost constraints:

• Cost is primary constraint → Backup/Restore or Pilot Light

• Balanced cost/recovery → Warm Standby

• Cost is not a constraint → Active-Active

STEP 5 — Confirm with Well-Architected Reliability Pillar:

• Is there a single point of failure? Eliminate it.

• Are health checks application-level (ELB), not just infrastructure-level (EC2)?

• Is the DR plan tested? (Untested = non-compliant)

• Are backups automated and cross-region?

Exam Tips

criticalRDS Multi-AZ vs Read Replicas

RDS Multi-AZ standby is NOT readable. It exists solely for failover. If you need read scaling, add Read Replicas separately. Exam questions frequently offer Multi-AZ as a distractor when the scenario asks for 'read scaling' — always pick Read Replicas for reads.

criticalHA vs FT distinction

High Availability ≠ Fault Tolerance. HA means 'minimal downtime with brief interruption possible.' FT means 'zero interruption — system continues operating through failures.' If the exam says 'no interruption to users even during failure,' the answer requires FT patterns (active-active, redundant components), not just HA.

criticalASG health check types

ELB health checks are superior to EC2 health checks in ASGs for application-layer fault detection. Always configure ASG to use ELB health checks in production architectures. The exam tests this: if an application crashes but the EC2 instance is still running, EC2 health checks will NOT trigger replacement — ELB health checks will.

criticalDR strategy comparison

Pilot Light vs Warm Standby: Pilot Light = only core components running (like a pilot light on a furnace — minimal, not functional without scaling up). Warm Standby = full stack running at reduced capacity (functional immediately, just needs scaling up). The exam uses these terms precisely — do not confuse them.

critical

RDS Multi-AZ standby is NEVER readable — it is for failover only. Read Replicas are for read scaling. These are separate features that must be combined for both HA and read scaling.

critical

HA ≠ FT: High Availability tolerates brief interruptions with fast recovery. Fault Tolerance means zero interruption — the system keeps running through failure. Match the pattern to the stated RTO: seconds = FT (active-active), minutes = HA (Multi-AZ failover).

critical

DR Strategy order by cost and RTO (cheapest/slowest → most expensive/fastest): Backup & Restore → Pilot Light → Warm Standby → Active-Active. The exam always tests your ability to map a given RTO/RPO requirement to the correct strategy.

importantRoute 53 health checks and DNS failover

Route 53 DNS TTL is critical for failover speed. Low TTL (e.g., 60 seconds) means DNS changes propagate faster. High TTL means clients cache the old record longer. The exam may ask why failover is slow — the answer is often 'TTL is set too high.' Always set low TTL for HA DNS configurations.

importantELB cross-zone load balancing

Cross-zone load balancing behavior differs by ELB type: ALB has it ENABLED by default at no extra charge. NLB and GWLB have it DISABLED by default and charge for inter-AZ data transfer when enabled. This is a frequent exam detail in cost optimization + HA questions.

importantSQS fault tolerance

SQS Dead Letter Queues (DLQ) are a fault tolerance mechanism, not just a debugging tool. They prevent poison-pill messages from blocking queue processing indefinitely. The exam tests DLQ as the answer when 'a failed message is blocking other messages from being processed' or 'messages are being retried indefinitely.'

importantDatabase failover speeds

Aurora Global Database can promote a secondary region to primary in under 1 minute (typically). Aurora Multi-AZ failover within a region is under 30 seconds. RDS Multi-AZ failover is 60–120 seconds. Know these relative speeds for RTO comparison questions.

importantAZ vs Region replication semantics

Availability Zones within a Region are connected via redundant, high-bandwidth, low-latency links — but they are physically separate facilities. This means synchronous replication between AZs is feasible (RDS Multi-AZ uses synchronous replication). Cross-region replication is always asynchronous due to physical distance.

Good to KnowWell-Architected Reliability Pillar

The Well-Architected Reliability Pillar's #1 principle is 'Test recovery procedures.' An untested DR plan is considered non-compliant. AWS Fault Injection Service (FIS) is the AWS-native tool for chaos engineering. Exam questions about 'validating resilience' point to FIS.

Common Misconceptions & Traps

Common Mistake

Multi-AZ RDS provides both high availability AND read scaling.

Correct

RDS Multi-AZ provides HA only. The standby instance is completely passive — it cannot serve any traffic, including reads. To scale reads, you must separately create Read Replicas (which are NOT automatically promoted on primary failure in standard RDS).

This is the single most tested RDS misconception on Solutions Architect exams. The exam will describe a scenario needing both HA and read scaling, and the correct answer is Multi-AZ PLUS Read Replicas — not Multi-AZ alone.

Common Mistake

High Availability and Fault Tolerance mean the same thing — both keep systems running.

Correct

HA allows brief, acceptable downtime (measured in seconds to minutes) and is achieved through redundancy with failover. FT means zero downtime — the system continues operating correctly even when components fail, through active redundancy (e.g., active-active with no failover needed). FT is more expensive and complex.

Exam scenarios that say 'users must experience no interruption' require FT patterns. Scenarios that say 'minimize downtime' or 'quickly recover' require HA patterns. Choosing HA when FT is required, or FT when HA is sufficient, will cost you points.

Common Mistake

Route 53 failover is instantaneous when a health check detects failure.

Correct

Route 53 failover is NOT instantaneous. Health checks have a polling interval (minimum 10 seconds for fast health checks, 30 seconds standard). After detection, DNS must propagate — and clients may cache the old record for the duration of the TTL. Total failover time can be minutes, not seconds.

Candidates assume DNS failover is as fast as ELB failover. For near-zero RTO, you need application-level failover (e.g., active-active with no failover needed), not DNS-based failover. Exam questions about 'seconds-level RTO' should not rely on Route 53 DNS failover alone.

Common Mistake

Pilot Light is the same as Warm Standby — both have resources running in the DR region.

Correct

Pilot Light has ONLY the minimum core components running (e.g., database replication, base AMIs registered) — the application servers are NOT running and must be launched and scaled during recovery. Warm Standby has the FULL application stack running at reduced capacity — it can serve traffic immediately, just needs scaling up.

The exam uses these terms with surgical precision. Pilot Light has a longer RTO than Warm Standby because of the launch time for application servers. If a question says 'the DR environment must be able to serve traffic immediately,' Warm Standby is correct, not Pilot Light.

Common Mistake

An Auto Scaling Group with EC2 health checks will automatically replace instances where the application has crashed.

Correct

EC2 health checks only detect if the underlying EC2 instance is running and reachable. If the application process crashes but the OS is still up, the EC2 health check reports the instance as healthy. Only ELB health checks (configured on the ASG) will detect application-level failures and trigger instance replacement.

This is a critical operational gap. Many architects configure ASGs without switching to ELB health checks, creating a false sense of fault tolerance. The exam tests this specifically — always use ELB health checks in ASGs for true application-level fault tolerance.

Common Mistake

Adding more Availability Zones always improves availability linearly.

Correct

While spreading across more AZs reduces the probability of simultaneous AZ failure, the relationship is not linear and there are diminishing returns. More critically, adding AZs increases complexity, inter-AZ data transfer costs, and the risk of split-brain scenarios in distributed systems. Two AZs vs three AZs is a nuanced trade-off, not a simple 'more is always better' decision.

Exam questions about cost optimization may have a correct answer of reducing AZ count when the scenario doesn't justify the cost. Understanding that AZ distribution has trade-offs prevents over-engineering answers.

Memory Tricks

🧠

BPWA = Backup, Pilot light, Warm standby, Active-active — the four DR strategies in order from LOWEST cost/HIGHEST RTO to HIGHEST cost/LOWEST RTO. 'Big Purple Whales Are expensive' — cost and complexity increase left to right.

🧠

HA vs FT: HA = 'Hold on, Almost back' (brief interruption, then recovery). FT = 'Fault? Totally fine' (no interruption, continues operating).

🧠

RDS Multi-AZ: The standby is 'Shy' — it Never Reads, Never Writes, just Waits. Add Read Replicas for reading.

🧠

ELB Health Checks > EC2 Health Checks: 'ELB checks if your App is Alive; EC2 checks if the Box is Alive.' You care about the App.

🧠

Route 53 Routing Policies — FLGWM: Failover, Latency, Geolocation, Weighted, Multivalue. 'Friendly Llamas Go With Me' — remember the five main routing policies.

Common Trap

Selecting RDS Multi-AZ as the solution for both high availability AND read performance — Multi-AZ only provides failover HA; the standby is completely non-readable. You must add Read Replicas separately for read scaling, making the correct answer 'Multi-AZ + Read Replicas,' not 'Multi-AZ alone.'

CertAI Tutor · · 2026-02-22

Ready to test your knowledge?

Practice exam questions with AI-powered explanations — free to start.