serverless

Event-Driven Architecture on AWS: Decouple, Scale, Conquer

Master the patterns, services, and decision logic that power modern serverless systems — and dominate certification exams.

Updated 2026-02-22

Overview

Event-driven architecture (EDA) is a design paradigm where services communicate by producing and consuming events, enabling loose coupling, independent scaling, and asynchronous processing. On AWS, EDA is foundational to serverless workloads and spans services like EventBridge, SNS, SQS, Kinesis, and Lambda. Understanding when to use each service — and why — is one of the highest-yield topics across the Solutions Architect, Developer, and SysOps certification tracks.

Exams test your ability to select the correct EDA service or pattern for a given scenario — particularly around ordering, durability, fan-out, replay, throughput, and latency requirements. Getting this wrong is the most common reason candidates fail architecture scenario questions.

Patterns & Strategies

Pub/Sub Fan-Out (SNS → SQS / Lambda)

A producer publishes a single message to an SNS topic. Multiple subscribers (SQS queues, Lambda functions, HTTP endpoints, email) each receive their own copy of the message independently and simultaneously. This decouples the producer from every downstream consumer.

✓

When one event must trigger multiple independent downstream actions — e.g., an order placed event that simultaneously notifies inventory, billing, and shipping services. Also use when you need to fan out to different processing pipelines.

⚠

SNS is not a queue — messages are not stored for retry by default. Adding SQS between SNS and consumers (SNS → SQS → Lambda) provides durability and retry. Without SQS, if a Lambda subscriber fails, the message is lost unless a DLQ is configured on the Lambda event source mapping.

Message Queuing with Decoupled Processing (SQS)

Producers write messages to an SQS queue. Consumers poll the queue and process messages independently. Messages are retained until explicitly deleted after successful processing. Standard queues offer at-least-once delivery with best-effort ordering; FIFO queues provide exactly-once processing and strict ordering within a message group.

✓

When you need to buffer workloads, smooth traffic spikes, or ensure no message is lost even if the consumer is temporarily unavailable. Use FIFO when strict ordering or exactly-once semantics are required (e.g., financial transactions, inventory updates).

⚠

SQS consumers must poll (pull model), which adds slight latency vs. push-based models. FIFO throughput is limited compared to Standard queues. Visibility timeout must be tuned carefully — too short causes duplicate processing, too long delays retry of failed messages.

Event Bus Routing (Amazon EventBridge)

Events are published to an EventBridge event bus. Rules evaluate each event against patterns and route matching events to one or more targets (Lambda, SQS, SNS, Step Functions, API Gateway, etc.). EventBridge supports schema registry, event replay, and cross-account/cross-region event routing.

✓

When you need content-based routing (route events based on their payload attributes), SaaS integration (Salesforce, Zendesk, etc. publish directly to EventBridge), scheduled events (cron/rate rules), or cross-account event architectures. EventBridge is the preferred modern EDA backbone on AWS.

⚠

EventBridge has a slightly higher per-event cost than SNS for high-volume use cases. It is not designed for ultra-high-throughput streaming (use Kinesis for that). Event delivery is at-least-once; consumers must be idempotent.

Streaming with Ordered, Replayable Events (Amazon Kinesis Data Streams)

Producers write records to shards in a Kinesis Data Stream. Multiple consumer applications can independently read from the same stream at their own pace using shard iterators or Enhanced Fan-Out. Records are retained for a configurable period (default 24 hours, up to 365 days), enabling replay. Ordering is guaranteed within a shard.

✓

When you need real-time, ordered, replayable event streams — e.g., clickstream analytics, IoT telemetry, log aggregation, or any scenario requiring multiple independent consumers reading the same data at different speeds. Use when throughput is high and ordering matters.

⚠

Kinesis requires shard management (provisioned mode) or costs scale with throughput (on-demand mode). Not suitable for simple task queuing — SQS is better for that. Enhanced Fan-Out adds cost but eliminates consumer read throttling.

Choreography vs. Orchestration

Choreography: Each service reacts to events independently with no central coordinator — services publish events and others subscribe. Orchestration: A central coordinator (AWS Step Functions) explicitly calls each service in sequence or parallel and manages state, retries, and error handling.

✓

Use choreography (pure EDA) for loosely coupled, independently deployable microservices where no single service needs global workflow visibility. Use orchestration (Step Functions) when you need complex workflow logic, long-running processes, human approval steps, or centralized error handling and visibility.

⚠

Choreography is harder to debug and trace end-to-end (use AWS X-Ray + EventBridge event tracing). Orchestration with Step Functions adds a central point of control but can create tighter coupling to the workflow definition. Step Functions Express Workflows are cost-effective for high-volume, short-duration workflows.

Event Sourcing and CQRS

Event Sourcing stores every state change as an immutable event in an append-only log (e.g., Kinesis or DynamoDB Streams). The current state is derived by replaying events. CQRS (Command Query Responsibility Segregation) separates the write model (commands that produce events) from the read model (optimized query projections updated by consuming events).

✓

When you need a full audit trail of all state changes, the ability to replay history to rebuild state, or when read and write workload patterns differ significantly. Common in financial systems, compliance-heavy applications, and complex domain models.

⚠

Increased complexity — the system must handle eventual consistency between write and read models. Replay can be expensive for very large event histories. Requires careful schema evolution strategy as event formats change over time.

Dead Letter Queue (DLQ) Pattern

When a consumer fails to process a message after a configured number of retries, the message is automatically moved to a Dead Letter Queue (a separate SQS queue or SNS topic). This prevents poison-pill messages from blocking the main queue and enables later inspection, alerting, and reprocessing.

✓

Always configure DLQs for production SQS queues and Lambda asynchronous invocations. Use CloudWatch alarms on DLQ depth to detect processing failures early. Essential for any EDA system where message loss is unacceptable.

⚠

DLQs require operational processes to monitor, investigate, and replay failed messages. For Lambda, the DLQ on the Lambda function only applies to async invocations — for SQS-triggered Lambda, configure the DLQ on the SQS queue itself, not the Lambda function.

Decision Framework

STEP 1 — Do you need ORDERING? → YES:

• Use SQS FIFO (small scale, exactly-once) or Kinesis Data Streams (high throughput, per-shard ordering). → NO: Use SQS Standard or SNS. |

STEP 2 — Do you need MULTIPLE INDEPENDENT CONSUMERS reading the SAME event? → YES:

• Use SNS (push fan-out) or Kinesis (pull, replayable, multiple consumer groups) or EventBridge (content-based routing to many targets). → NO: Use SQS (single logical consumer group). |

STEP 3 — Do you need CONTENT-BASED ROUTING (route based on event payload)? → YES:

• Use EventBridge rules. → NO: SNS message filtering or SQS. |

STEP 4 — Do you need EVENT REPLAY / REWIND? → YES:

• Use Kinesis Data Streams (configurable retention) or EventBridge Archive & Replay. → NO: SNS or SQS (limited retention). |

STEP 5 — Do you need REAL-TIME HIGH-THROUGHPUT STREAMING (MB/s, analytics)? → YES:

• Kinesis Data Streams or Amazon MSK (Kafka). → NO: SQS/SNS/EventBridge. |

STEP 6 — Do you need SAAS / THIRD-PARTY or CROSS-ACCOUNT event integration? → YES:

• EventBridge (partner event sources). → NO: SNS/SQS internal. |

STEP 7 — Do you need COMPLEX WORKFLOW LOGIC with retries, branching, human approval? → YES:

• Step Functions (orchestration). → NO: Pure EDA choreography. |

STEP 8 — Is DURABILITY / NO MESSAGE LOSS critical? → YES:

• SQS (persistent queue) + DLQ. SNS alone is NOT durable — pair with SQS. → NO: SNS push is sufficient.

Exam Tips

criticalSNS Fan-Out, SQS durability, Lambda async invocation

SNS alone does NOT guarantee delivery durability. If a Lambda subscriber is unavailable when SNS pushes the message, the message is lost. The canonical durable fan-out pattern is SNS → SQS → Lambda. The SQS queue absorbs the message and retries delivery to Lambda. Always add SQS between SNS and Lambda for durable, retriable fan-out.

criticalDLQ, Lambda event source mapping, SQS

For SQS-triggered Lambda, configure the Dead Letter Queue on the SQS QUEUE — NOT on the Lambda function. The Lambda DLQ only applies to asynchronous (non-polling) Lambda invocations. This is a classic trap: candidates configure the DLQ on Lambda and wonder why failed SQS messages aren't being captured.

criticalEventBridge rules, SNS message filtering

EventBridge is the PREFERRED modern event bus for new architectures. When a question mentions routing events based on content/payload attributes, cross-account events, SaaS integrations, or scheduled rules, EventBridge is almost always the answer over SNS. SNS does not support content-based routing on arbitrary JSON fields — EventBridge rules do.

criticalKinesis shards, partition key, ordering

Kinesis Data Streams guarantees ordering WITHIN a shard, not across shards. To ensure related records are processed in order, use a consistent partition key so they always land on the same shard. If a question asks about ordering across ALL records globally, Kinesis alone cannot guarantee that — you'd need a single shard (which limits throughput) or a different approach.

criticalSQS FIFO, exactly-once, Message Group ID

SQS FIFO queues provide exactly-once processing and strict ordering within a Message Group ID. Throughput is lower than Standard queues. Use FIFO when the exam scenario mentions 'no duplicate processing', 'exactly-once', or 'strict order' — especially for financial or inventory systems.

critical

SNS alone is NOT durable. For guaranteed fan-out delivery, always use SNS → SQS → Lambda. The SQS queue is the durability layer.

critical

DLQ for SQS-triggered Lambda goes on the SQS QUEUE. DLQ for async Lambda invocations (SNS, S3, EventBridge) goes on the LAMBDA FUNCTION. Never mix these up.

critical

EventBridge beats SNS when you need content-based routing on JSON payload fields, cross-account events, SaaS integrations, or event replay. SNS beats EventBridge for simple, ultra-high-volume topic fan-out.

importantEventBridge Archive, event replay

EventBridge Archive & Replay allows you to replay past events to a bus — useful for debugging, reprocessing after a consumer bug fix, or testing new consumers against historical data. This is a differentiator from SNS/SQS which do not support replay of already-delivered messages.

importantLambda concurrency, SQS batch processing, visibility timeout

Lambda's SQS event source mapping automatically scales the number of concurrent Lambda executions based on queue depth — up to the configured maximum concurrency. Each batch of messages from SQS is processed by a single Lambda invocation. If Lambda fails to process a batch, the entire batch becomes visible again (not individual messages). Configure batch size and visibility timeout carefully.

importantStep Functions Standard vs Express

Step Functions Standard Workflows are for long-running, durable workflows (up to 1 year). Express Workflows are for high-volume, short-duration workloads (up to 5 minutes) and are significantly cheaper per execution. For exam questions about orchestrating millions of short IoT events, Express Workflows is the right answer.

importantKinesis Enhanced Fan-Out, consumer throughput

Kinesis Enhanced Fan-Out allows each registered consumer to receive data at 2 MB/s per shard independently via a push model (HTTP/2). Without Enhanced Fan-Out, all consumers SHARE the 2 MB/s read throughput per shard. Use Enhanced Fan-Out when multiple consumers need full throughput simultaneously.

Good to KnowDynamoDB Streams, Lambda, CDC

DynamoDB Streams can trigger Lambda to process item-level changes in near real-time — this is a powerful EDA pattern for change data capture (CDC). Records in DynamoDB Streams are available for 24 hours. This is commonly tested in scenarios requiring 'react to database changes without polling'.

Common Misconceptions & Traps

Common Mistake

SNS is a reliable, durable message store — if I publish to SNS, the message will eventually be delivered no matter what.

Correct

SNS is a push-based notification service with no persistent message storage. If a subscriber endpoint is unavailable when SNS attempts delivery, the message is retried a limited number of times and then dropped (unless a DLQ is configured on the SNS subscription). SNS is NOT a queue. For guaranteed delivery, always pair SNS with SQS.

This misconception causes architects to design systems with silent message loss. On exams, any scenario requiring guaranteed delivery + fan-out should trigger 'SNS → SQS → Lambda', not 'SNS → Lambda' alone.

Common Mistake

I can use SQS FIFO for any high-throughput use case that needs ordering — it scales just like Standard SQS.

Correct

SQS FIFO queues have lower throughput limits than Standard queues. While AWS has increased FIFO throughput over time, it is not equivalent to Standard queues for very high-volume workloads. For high-throughput ordered streaming, Kinesis Data Streams is the appropriate service. FIFO is best for transactional, lower-volume ordered processing.

Candidates default to FIFO for all ordering requirements. Exams specifically test whether you know when Kinesis is the better fit. The keyword triggers are: 'high throughput', 'streaming', 'real-time analytics', 'multiple consumers' → Kinesis. 'Exactly-once', 'transactional', 'moderate volume' → SQS FIFO.

Common Mistake

EventBridge and SNS do the same thing — they both publish events to multiple subscribers, so I can use them interchangeably.

Correct

SNS and EventBridge serve overlapping but distinct purposes. SNS is optimized for simple, high-volume pub/sub with topic-based routing and message filtering on message attributes. EventBridge supports rich content-based routing on any JSON field in the event body, schema registry, event replay, cross-account buses, SaaS partner integrations, and scheduled rules. EventBridge is the strategic, modern choice for complex EDA; SNS is simpler and cheaper for basic fan-out at scale.

Choosing between SNS and EventBridge is a frequent exam decision point. The decisive factor: if the scenario requires routing based on event payload content, cross-account delivery, SaaS sources, or event replay → EventBridge. If it's simple topic-based fan-out to many subscribers at very high volume → SNS may be more appropriate and cost-effective.

Common Mistake

Configuring a DLQ on my Lambda function will capture messages that fail during SQS-triggered processing.

Correct

The Lambda function-level DLQ only captures failures from asynchronous Lambda invocations (e.g., S3 event notifications, SNS triggers, EventBridge rules invoking Lambda directly). When Lambda is triggered by SQS via an event source mapping, the DLQ must be configured on the SQS QUEUE itself — not on the Lambda function. Failed batches are returned to the SQS queue and retried until they expire or are moved to the queue's DLQ.

This is one of the most commonly missed operational details on Developer and SysOps exams. Remember: SQS-triggered Lambda → DLQ on SQS. Async Lambda (SNS, S3, EventBridge) → DLQ on Lambda.

Common Mistake

In a choreography-based EDA system, I don't need to worry about duplicate event processing because each event is unique.

Correct

EventBridge, SNS, and SQS Standard all provide at-least-once delivery semantics — meaning a consumer may receive the same event more than once under failure conditions. All EDA consumers must be designed to be IDEMPOTENT (processing the same event multiple times produces the same result). Only SQS FIFO provides exactly-once processing within its deduplication window.

Idempotency is a foundational EDA concept tested across all certification levels. When an exam question asks how to handle duplicate events, the answer is always: design idempotent consumers (use idempotency keys, conditional writes in DynamoDB, etc.) — not 'configure the service to never send duplicates'.

Common Mistake

Step Functions is only for long-running batch workflows — it's too slow and expensive for real-time event processing.

Correct

Step Functions Express Workflows are designed for high-volume, short-duration (up to 5 minutes) event-driven workloads and are priced per state transition — making them cost-effective for real-time orchestration of millions of events. They integrate natively with EventBridge, SQS, Lambda, and other services. Express Workflows are the right orchestration tool for IoT event processing, streaming pipelines, and API-driven microservice coordination.

Candidates dismiss Step Functions for real-time scenarios. Exams specifically test the Standard vs. Express distinction. Duration and volume are the decision variables: long-running + audit trail → Standard. High-volume + short-duration → Express.

Memory Tricks

🧠

SEEK ORDER: S=SQS (queue/buffer), E=EventBridge (content routing), E=Event Sourcing (audit/replay), K=Kinesis (streaming/ordered), O=Orchestration via Step Functions, R=Replay with EventBridge Archive, D=DLQ for durability, E=Eventually consistent (design idempotent consumers), R=Route with SNS fan-out.

🧠

DLQ placement rule: 'SQS owns its own DLQ' — if SQS triggers Lambda, the DLQ lives on the SQS queue. If SNS/S3/EventBridge triggers Lambda async, the DLQ lives on Lambda.

🧠

Fan-out durability: 'SNS needs a Sidekick (SQS)' — SNS alone drops messages if consumers are down. SNS + SQS = durable fan-out.

🧠

Kinesis vs SQS ordering: 'Kinesis Keeps order per shard (Key it right with partition key). SQS FIFO keeps order per Message Group ID.'

Common Trap

Configuring a DLQ on the Lambda function to capture failed SQS-triggered messages — this does NOT work. The DLQ for SQS-triggered Lambda must be on the SQS queue itself, not the Lambda function. Lambda's DLQ only applies to asynchronous (non-polling) invocations.

CertAI Tutor · · 2026-02-22

Ready to test your knowledge?

Practice exam questions with AI-powered explanations — free to start.

Event-Driven Architecture on AWS: Decouple, Scale, Conquer

Overview

Patterns & Strategies

Decision Framework

Exam Tips

Common Misconceptions & Traps

Memory Tricks

Common Trap

Ready to test your knowledge?

Related Cheat Sheets