
Cargando...
Master the design patterns, trade-offs, and exam traps that separate cloud architects from cloud practitioners
Serverless architecture patterns describe how to compose AWS managed services — Lambda, API Gateway, SQS, SNS, EventBridge, Step Functions, DynamoDB, S3, and more — into resilient, scalable systems without managing infrastructure. Understanding these patterns is critical for the Solutions Architect Associate/Professional, Developer Associate, and DevOps Engineer exams because scenario questions routinely require you to select the correct pattern based on latency, cost, coupling, throughput, and failure-handling requirements. Mastery means knowing not just what each service does, but how and why to combine them.
Certification exams present complex scenario questions (e.g., 'millions of events per day, exactly-once processing, downstream system is slow') and require you to select the correct architectural pattern — knowing the trade-offs between synchronous vs. asynchronous, event-driven vs. request-driven, and choreography vs. orchestration is what separates passing scores from failing ones.
Synchronous Request-Response (API Gateway + Lambda)
A client sends an HTTP/REST or WebSocket request to API Gateway, which synchronously invokes a Lambda function and returns the response in the same connection. The caller waits for the result. This is the most common entry point for web and mobile backends.
User-facing APIs requiring immediate responses (login, product lookup, form submission), mobile/web backends, and microservice endpoints where latency matters and the caller needs a result.
Tightly couples caller to execution time. If Lambda takes too long, the client times out. API Gateway has integration timeout limits. Not suitable for long-running workloads. Cold starts can affect perceived latency for infrequent traffic.
Asynchronous Event-Driven (S3 / SNS / EventBridge → Lambda)
An event source (S3 object upload, SNS publish, EventBridge rule match) asynchronously invokes Lambda. The producer does not wait for the consumer to finish. Lambda internally queues the event and retries on failure (up to the configured retry count with a Dead Letter Queue or on-failure destination for unprocessed events).
File processing (image resize on S3 upload), fan-out notifications, decoupled microservices, scheduled jobs (EventBridge Scheduler), and any workflow where the producer should not block on consumer completion.
No immediate response to the producer. Duplicate events are possible (at-least-once delivery), so consumers must be idempotent. Retry behavior and DLQ configuration must be explicitly designed. Ordering is not guaranteed with SNS fan-out.
Queue-Based Load Leveling (SQS → Lambda)
Messages are placed into an SQS queue (Standard or FIFO). Lambda polls the queue via an Event Source Mapping, processing messages in batches. SQS acts as a buffer, absorbing traffic spikes and smoothing load on downstream Lambda or other consumers. Lambda scales by adding polling threads (concurrent executions) proportional to queue depth.
Absorbing bursty workloads, protecting a downstream system with limited throughput (e.g., a legacy database), order processing pipelines, and any scenario requiring decoupling of producer and consumer throughput.
Standard SQS offers at-least-once delivery and best-effort ordering — consumers must be idempotent. FIFO SQS provides exactly-once processing and strict ordering but has lower throughput. Visibility timeout must exceed Lambda function duration or messages will reappear and be processed twice.
Fan-Out (SNS + SQS or EventBridge)
A single event or message is published to SNS or EventBridge, which delivers it to multiple subscribers simultaneously (SQS queues, Lambda functions, HTTP endpoints, email, SMS). Each subscriber processes the event independently. This decouples the publisher from all consumers.
Sending the same event to multiple downstream systems (e.g., order placed → inventory service, email service, analytics service all notified simultaneously), multi-tenant notification systems, and cross-service event broadcasting.
SNS is push-based; if a subscriber is unavailable, the message may be lost unless an SQS queue is added as a buffer (SNS + SQS fan-out is the canonical pattern). EventBridge offers richer filtering and cross-account/cross-region delivery. More complex to debug than point-to-point.
Orchestration (AWS Step Functions)
Step Functions defines a state machine (in Amazon States Language) that orchestrates multiple Lambda functions, AWS SDK integrations, and wait states into a coordinated workflow. The orchestrator tracks state, handles retries, catches errors, and manages branching logic. Express Workflows are for high-volume, short-duration flows; Standard Workflows for long-running, auditable processes.
Multi-step business processes (order fulfillment, loan approval), workflows requiring human approval steps (wait for callback), long-running jobs exceeding Lambda's maximum execution duration, complex error handling with compensating transactions (saga pattern), and anywhere you need a visual audit trail.
Adds cost per state transition (Standard) or per duration/invocation (Express). Introduces a central orchestrator which can become a bottleneck or single point of failure if not designed carefully. Requires learning Amazon States Language. Not ideal for simple, single-step async tasks.
Choreography (EventBridge Event Bus)
Services communicate by emitting and reacting to events on a shared event bus without a central coordinator. Each service publishes domain events (e.g., 'OrderPlaced') and subscribes to events it cares about. Services are fully decoupled — they don't know about each other, only about the event schema.
Highly decoupled microservice architectures, domain-driven design implementations, cross-account or cross-region event routing, and systems where teams own independent services that must react to each other's state changes without tight coupling.
Harder to trace end-to-end flow (use EventBridge + CloudWatch or X-Ray for observability). No central retry/compensation logic — each service must handle its own failures. Schema drift between producers and consumers can cause silent failures; use EventBridge Schema Registry.
Strangler Fig (Incremental Serverless Migration)
Gradually replace a monolithic application by routing specific routes or features to new serverless implementations (Lambda + API Gateway) while the legacy system continues serving other routes. Over time, more routes are migrated until the monolith is retired. API Gateway acts as the facade/router.
Migrating legacy monolithic applications to serverless without a risky big-bang rewrite. Particularly useful when the team needs to demonstrate incremental value and reduce risk during modernization.
Requires maintaining two systems in parallel during migration. Routing logic in API Gateway can become complex. Data consistency between old and new systems must be carefully managed. Longer migration timelines.
Saga Pattern (Distributed Transaction Management)
In a microservices/serverless architecture, a saga breaks a distributed transaction into a sequence of local transactions, each publishing an event or message to trigger the next step. If a step fails, compensating transactions are executed to undo previous steps. Implemented via Step Functions (orchestration saga) or EventBridge (choreography saga).
Multi-service workflows requiring data consistency without distributed ACID transactions (e.g., booking a flight + hotel + car where each is a separate service). E-commerce order processing spanning inventory, payment, and shipping services.
Compensating transactions must be carefully designed and are not always possible (e.g., sending an email cannot be unsent). Eventual consistency means the system may be temporarily inconsistent. Debugging failed sagas requires good observability.
Backend for Frontend (BFF)
A dedicated API Gateway + Lambda backend is created for each frontend type (mobile app, web app, third-party API). Each BFF aggregates, transforms, and filters data from multiple downstream services, returning exactly what each frontend needs — reducing over-fetching and under-fetching.
Applications with multiple client types (iOS, Android, web) that have different data requirements. Reduces the complexity pushed to clients and optimizes network payloads for constrained devices.
Multiple BFF backends increase the number of services to maintain. Risk of duplicating business logic across BFFs — shared logic should be extracted to internal services. Adds latency if the BFF must call many downstream services synchronously.
Event Sourcing + CQRS with DynamoDB Streams
All state changes are stored as immutable events in DynamoDB. DynamoDB Streams captures these change events and triggers Lambda to update read-optimized projections (query models) in separate tables or services. Commands (writes) and Queries (reads) use separate models optimized for their purpose.
Audit-heavy applications (financial transactions, compliance systems), systems requiring temporal queries ('what was the state at time T?'), and high-read/write ratio systems where read and write scaling requirements differ significantly.
Significantly increases architectural complexity. Eventual consistency between write and read models must be acceptable. Replaying events to rebuild state can be time-consuming for large event stores. Requires disciplined event schema design.
• STEP 1 — Does the caller need an immediate response? YES → Synchronous pattern (API Gateway + Lambda). NO → Go to Step 2. |
• STEP 2 — Is there a risk of traffic spikes overwhelming a downstream system? YES → Queue-based load leveling (SQS → Lambda). NO → Go to Step 3. |
• STEP 3 — Does one event need to reach MULTIPLE consumers simultaneously? YES → Fan-out (SNS + SQS or EventBridge). ONE consumer → Go to Step 4. |
• STEP 4 — Is the workflow multi-step with complex error handling, retries, or human approval? YES → Orchestration (Step Functions Standard Workflow for long-running; Express Workflow for high-volume short-duration). NO → Go to Step 5. |
• STEP 5 — Are services fully independent and owned by separate teams with no shared knowledge? YES → Choreography (EventBridge). Shared team/tight coordination needed → Step Functions orchestration. |
• STEP 6 — Is this a distributed transaction spanning multiple services that must be consistent? YES → Saga pattern (Step Functions orchestration saga preferred for auditability). |
• STEP 7 — Is ordering and exactly-once processing required? YES → SQS FIFO + Lambda (lower throughput, strict ordering). Ordering not required → SQS Standard (higher throughput, at-least-once). |
• STEP 8 — Is Lambda execution duration a concern for long-running jobs? YES → Step Functions (no Lambda timeout concern for the overall workflow) or use S3 + Batch for very long compute jobs.
SQS Visibility Timeout must ALWAYS be set greater than or equal to your Lambda function's timeout. If Lambda takes 5 minutes to process a message but visibility timeout is 30 seconds, the message becomes visible again and gets processed by another Lambda instance — causing duplicate processing. This is a classic exam scenario.
Step Functions Standard Workflows are priced per state transition and support durations up to 1 year — use them for auditable, long-running business processes. Express Workflows are priced per invocation + duration and are designed for high-volume, short-duration (up to 5 minutes) event processing. Choosing the wrong type is a common exam trap.
SNS alone does NOT guarantee delivery if a subscriber is temporarily unavailable. The canonical resilient fan-out pattern is SNS → SQS (each subscriber gets its own SQS queue). This adds durability and decoupling. Exam questions about 'durable fan-out' or 'resilient notification' should trigger this pattern.
Lambda asynchronous invocations (from S3, SNS, EventBridge) automatically retry twice on failure. After retries are exhausted, the event is sent to a Dead Letter Queue (SQS or SNS) or an on-failure destination (SQS, SNS, Lambda, EventBridge). You must configure this explicitly — it is NOT automatic.
EventBridge is the preferred service for cross-account and cross-region event routing in modern serverless architectures. It supports content-based filtering (route only events matching specific attribute values), schema registry, and event replay — capabilities SNS does not natively offer. When the exam mentions 'event filtering' or 'cross-account events', think EventBridge.
API Gateway has two timeout considerations: the maximum integration timeout (29 seconds for REST APIs) and the overall request timeout. If your Lambda needs more than 29 seconds, the synchronous pattern breaks — you must switch to an async pattern (return 202 Accepted, use SQS/Step Functions, poll for result) or use WebSockets.
SQS Visibility Timeout must always be set GREATER than Lambda function timeout — if not, messages reappear and get processed multiple times. This is the #1 SQS+Lambda configuration trap on all certification exams.
Step Functions Standard = long-running auditable workflows (up to 1 year, priced per state transition). Step Functions Express = high-volume short workflows (up to 5 minutes, priced per duration). Wrong type selection is a guaranteed wrong answer.
For resilient fan-out, ALWAYS use SNS+SQS (not SNS alone). SNS alone loses messages if a subscriber is unavailable. Adding an SQS queue per subscriber provides durability and decoupling — this is the canonical AWS pattern.
For idempotency in serverless: Lambda may be invoked more than once for the same event (at-least-once delivery from async sources). Always design Lambda functions to be idempotent — use a unique event/request ID stored in DynamoDB as a deduplication key. The exam tests this in fraud prevention, payment, and order scenarios.
DynamoDB Streams + Lambda is the correct pattern for change data capture (CDC) in serverless architectures. When data changes in DynamoDB, the stream triggers Lambda to propagate changes to Elasticsearch/OpenSearch, send notifications, or update aggregate tables. This is tested in real-time analytics and search scenarios.
Kinesis Data Streams vs SQS: Use Kinesis when you need ordered, replayable event streams with multiple independent consumers reading the same data at their own pace (e.g., analytics + archiving + ML all reading the same clickstream). Use SQS when you need simple work distribution across multiple worker instances processing each message once.
Lambda Destinations (on-success and on-failure) are the modern replacement for DLQs on async invocations. Destinations support routing to SQS, SNS, Lambda, or EventBridge and include the full event payload plus execution context. DLQs only receive the failed payload. For new architectures, prefer Destinations over DLQs.
Common Mistake
Lambda is always cheaper than EC2/containers for any workload
Correct
Lambda is cost-effective for spiky, unpredictable, or low-volume workloads. For consistently high-throughput workloads running 24/7, EC2 Reserved Instances or Fargate can be significantly cheaper. Lambda pricing is based on invocation count AND duration — a Lambda running continuously at maximum concurrency can exceed equivalent container costs.
Exam questions on cost optimization require you to recognize when serverless is NOT the right choice. If a scenario describes 'constant high-volume processing 24/7', containers or EC2 may be more cost-effective. The answer is never blindly 'use Lambda'.
Common Mistake
SQS FIFO queues should always be used instead of Standard queues for better reliability
Correct
SQS FIFO provides exactly-once processing and strict ordering but has significantly lower throughput limits than Standard queues. Standard queues provide at-least-once delivery with best-effort ordering at much higher throughput. FIFO is correct when ordering and deduplication are business requirements — not as a general upgrade.
Exam traps often present a high-throughput scenario and ask you to pick FIFO — the correct answer is Standard with idempotent consumers. Choosing FIFO when throughput is the primary requirement is a design mistake.
Common Mistake
Step Functions replaces Lambda — you use one or the other
Correct
Step Functions ORCHESTRATES Lambda functions (and other AWS services). They are complementary: Lambda executes compute logic, Step Functions manages the workflow, state, retries, and branching between multiple Lambda invocations. Step Functions itself does not run your code — it coordinates services that do.
Candidates sometimes think Step Functions is an alternative to Lambda for complex logic. The correct mental model is: Step Functions = workflow engine, Lambda = function execution. They work together.
Common Mistake
EventBridge and SNS are interchangeable for pub/sub messaging
Correct
SNS is optimized for high-throughput, push-based notifications to multiple subscribers (Lambda, SQS, HTTP, email, SMS) with simple topic-level filtering. EventBridge offers content-based filtering on event attributes, schema registry, event replay, cross-account/cross-region routing, and integration with 200+ AWS services and SaaS partners. EventBridge is the preferred choice for event-driven architectures; SNS is preferred for simple notification fan-out.
Exam questions distinguish between these services based on requirements. 'Filter events by specific field values' → EventBridge. 'Send SMS/email notifications' → SNS. Confusing them leads to wrong answers on architecture design questions.
Common Mistake
Serverless means you have zero operational responsibility
Correct
Serverless shifts infrastructure management to AWS, but you remain responsible for: application code correctness, IAM permissions (least privilege), error handling and retry logic, observability (CloudWatch, X-Ray tracing), cold start optimization, concurrency limits and throttling, DLQ/destination configuration, and cost management. Serverless reduces, not eliminates, operational burden.
The Shared Responsibility Model applies to serverless. Exam questions on security, compliance, and troubleshooting expect you to know what you are still responsible for in a serverless architecture.
Common Mistake
Lambda concurrency scaling is instant and unlimited
Correct
Lambda has account-level concurrency limits (soft limits that can be increased via Support). By default, Lambda can scale to a burst limit (which varies by region) and then scales at a rate of 500 additional concurrent executions per minute after the burst. Sudden massive traffic spikes can cause throttling (429 errors) until scaling catches up. Reserved concurrency can also cap a function's maximum concurrency.
Exam scenarios about handling sudden traffic spikes must account for Lambda's scaling behavior. SQS as a buffer in front of Lambda is the correct pattern to absorb spikes without throttling downstream systems.
Common Mistake
Using synchronous Lambda invocations through API Gateway is always the right pattern for microservices communication
Correct
Synchronous service-to-service calls create tight coupling, cascade failures (if one service is slow, callers time out), and make the system fragile under load. The recommended pattern for microservice communication is asynchronous messaging (SQS, SNS, EventBridge) for non-latency-sensitive operations. Synchronous calls should be reserved for truly latency-sensitive, user-facing interactions.
Architecture questions testing microservices best practices will penalize designs that chain synchronous Lambda calls. Asynchronous decoupling is the correct answer for resilience and scalability.
SAFE-BOSS: Synchronous (API GW+Lambda), Async Event-driven (S3/SNS→Lambda), Fan-out (SNS+SQS), EventBridge (choreography), BFF (Backend for Frontend), Orchestration (Step Functions), Saga (distributed transactions), Strangler Fig (migration) — the 8 core serverless patterns.
For SQS+Lambda: 'VTF' — Visibility Timeout must exceed Function Timeout, or you get duplicates.
Step Functions: 'SALE' — Standard=Auditable/Long-running/Expensive-per-transition; Express=Short-duration/High-volume/Low-cost-per-execution.
SNS alone = 'fire and forget'. SNS+SQS = 'fire and remember'. When durability matters, always add the queue.
Selecting SQS FIFO over Standard for a high-throughput scenario because it sounds 'more reliable' — FIFO provides ordering and exactly-once processing but at the cost of throughput. For high-volume workloads without strict ordering requirements, SQS Standard with idempotent Lambda consumers is the correct and more scalable answer.
CertAI Tutor · · 2026-02-22
Key Services
Comparisons
Guides & Patterns