
Cargando...
Visually coordinate distributed applications and microservices using state machines — without managing infrastructure
AWS Step Functions is a fully managed serverless orchestration service that lets you model workflows as state machines using Amazon States Language (ASL). It coordinates AWS services, Lambda functions, and human approval steps into reliable, auditable, and scalable workflows. Step Functions handles error handling, retries, branching, parallel execution, and state management so your application code doesn't have to.
Replace complex, error-prone custom orchestration code with a managed, visual, auditable state machine that coordinates distributed services reliably at scale
Use When
Avoid When
Standard Workflows
Exactly-once execution semantics, up to 1-year duration, full execution history, charged per state transition. Best for long-running, auditable workflows.
Express Workflows
At-least-once execution semantics, up to 5-minute duration, logs to CloudWatch, charged per execution count + duration. Best for high-volume, short-duration event processing.
Synchronous Express Workflows
Waits for workflow to complete and returns result inline — ideal for API Gateway + Step Functions direct integration where caller needs immediate response.
Asynchronous Express Workflows
Fire-and-forget — caller does not wait. Logs results to CloudWatch Logs.
Wait for Task Token (Callback Pattern)
Pauses execution until an external system calls SendTaskSuccess/SendTaskFailure with the task token. Critical for human approvals and external system integrations.
Activity Workers
Long-polling workers (on EC2, ECS, on-premises) that pull tasks from Step Functions. Enables hybrid cloud workflows.
Map State
Iterates over an array and runs states in parallel for each element. Supports MaxConcurrency to throttle parallel branches.
Parallel State
Runs multiple independent branches simultaneously. All branches must complete before proceeding.
Choice State
Implements conditional branching (if/else, switch) based on input data using comparison operators.
Wait State
Pauses execution for a fixed duration or until a specific timestamp. Useful for scheduling future actions within a workflow.
SDK Integrations (Optimized)
Direct integrations with 200+ AWS services without Lambda — invoke DynamoDB, SQS, SNS, ECS, Glue, Athena, Bedrock, etc. directly from state machine.
AWS SDK Integrations (Generic)
Call any AWS SDK API from a Task state using the aws-sdk integration pattern — covers thousands of API actions.
Error Handling and Retries
Built-in Catch and Retry blocks with exponential backoff, jitter, and max attempts. Eliminates custom retry logic in application code.
Encryption at Rest
State machine definitions and execution history encrypted using AWS KMS (customer-managed or AWS-managed keys).
IAM Integration
Each state machine has an IAM execution role. Least-privilege roles per state machine are a security best practice.
EventBridge Integration
Step Functions can be triggered by EventBridge rules and can emit events to EventBridge upon completion/failure.
X-Ray Tracing
End-to-end distributed tracing across state machine executions and downstream service calls.
CloudWatch Metrics and Alarms
Monitor execution counts, failures, throttles, and duration. Set alarms for failed executions.
Workflow Studio (Visual Designer)
Drag-and-drop visual editor in the AWS Console — generates ASL automatically. Useful for rapid prototyping.
Versioning and Aliases
Publish state machine versions and create aliases for blue/green deployment patterns. Similar to Lambda versioning.
Sub-workflows (Nested State Machines)
A state machine can start another state machine execution — enables modular, reusable workflow composition.
JSONPath and JSONata
Use JSONPath expressions to filter, transform, and pass data between states. JSONata supported for more powerful transformations.
Lambda Orchestration
high freqStep Functions invokes Lambda functions as task states. Lambda handles compute; Step Functions handles flow control, retries, and error handling. Eliminates 'Lambda calling Lambda' anti-pattern which creates tight coupling and cascading timeouts.
Event-Driven Workflow Trigger
high freqEventBridge rules trigger Step Functions executions in response to AWS service events, scheduled events, or custom application events. Step Functions can also emit events to EventBridge on completion/failure for downstream orchestration.
Decoupled Task Processing
high freqStep Functions sends messages to SQS queues as task states, enabling decoupled processing with downstream consumers. Alternatively, SQS triggers Lambda which starts Step Functions executions for batch processing workflows.
Workflow Notification and Fan-out
high freqStep Functions publishes to SNS topics to notify multiple subscribers of workflow state changes (e.g., order processed, approval needed). SNS handles the fan-out; Step Functions handles the orchestration logic.
Synchronous API-Backed Workflow
high freqAPI Gateway directly integrates with Step Functions (no Lambda needed) to start Synchronous Express Workflow executions and return results to the API caller. Eliminates Lambda middleman for simple orchestration APIs.
Automated Compliance Remediation
medium freqAWS Config detects non-compliant resources and triggers EventBridge events, which start Step Functions workflows to orchestrate multi-step remediation actions (notify, remediate, verify, escalate if failed).
Direct SDK Integration for State Persistence
medium freqStep Functions uses optimized SDK integration to read/write DynamoDB directly (no Lambda) for workflow state tracking, idempotency tokens, and audit logging. Reduces cost and complexity.
Serverless ETL Orchestration
medium freqStep Functions orchestrates Glue jobs, Athena queries, and Lambda transforms in sequence or parallel. Handles job polling (using .sync integration pattern), retries on failure, and conditional branching based on data quality checks.
Container Task Orchestration
medium freqStep Functions runs ECS/Fargate tasks directly using optimized integrations (.sync pattern waits for task completion). Enables long-running container workloads (>15 min) that Lambda cannot support.
Operational Monitoring and Alerting
medium freqCloudWatch captures Step Functions metrics (ExecutionsFailed, ExecutionsThrottled, ExecutionTime). Alarms trigger SNS notifications or Lambda remediation. Express Workflows log all events to CloudWatch Logs for debugging.
Nested / Child Workflow Execution
medium freqA parent state machine starts child state machine executions as task states. Enables modular workflow composition, code reuse, and separation of concerns. Child workflows can run synchronously or asynchronously.
Generative AI Workflow Orchestration
medium freqStep Functions orchestrates multi-step GenAI workflows: retrieve context (RAG), invoke Bedrock model, evaluate response, route based on confidence score, store results. Handles retries on model throttling automatically.
Standard vs Express is the #1 decision point: Standard = exactly-once, 1-year max, charged per transition, full history in console. Express = at-least-once, 5-min max, charged per execution+duration, logs to CloudWatch. Match workflow type to requirements BEFORE answering.
The 'Lambda calling Lambda' anti-pattern: Never chain Lambda functions directly (tight coupling, cascading timeouts, no error handling). The correct answer is always Step Functions for multi-step Lambda orchestration. If you see 'coordinate multiple Lambda functions', Step Functions is the answer.
For workflows requiring tasks >15 minutes: Lambda cannot be the compute layer (15-min hard limit). Use Step Functions with ECS Fargate tasks, Activity Workers, or the Wait for Task Token callback pattern to handle arbitrarily long external operations.
Wait for Task Token (callback pattern) is the answer for: human approval workflows, waiting for external system responses, and any scenario where Step Functions must pause indefinitely until an external event occurs. The task token is passed to the external system, which calls SendTaskSuccess/SendTaskFailure to resume.
Large payload anti-pattern: State machine input/output is limited to 256 KB. For large datasets (S3 files, database results), NEVER pass data directly through states. Store in S3 and pass only the S3 reference. This pattern appears in almost every ETL/data processing question.
Standard vs Express: Standard = exactly-once + 1 year + per-transition pricing + console history. Express = at-least-once + 5 min + per-execution/duration pricing + CloudWatch Logs only. This distinction drives 40% of Step Functions exam questions.
Lambda's 15-minute timeout is ABSOLUTE and UNCHANGEABLE. For tasks exceeding 15 minutes, the correct architecture is Step Functions + ECS Fargate tasks OR Step Functions + Activity Workers OR the Wait for Task Token callback pattern.
256 KB payload limit: Any exam scenario involving large datasets (S3 files, query results, ML model outputs) passing through Step Functions states requires storing data in S3 and passing only the S3 key/ARN as state data.
Optimized SDK integrations (.sync, .waitForTaskToken) are tested: .sync waits for job completion (Glue, Athena, ECS). .waitForTaskToken pauses until callback. Without a suffix, the integration is request-response (fire and forget). Know all three patterns.
For high-volume, short-duration event processing (IoT, streaming, microservices): Express Workflows are the answer. They support unlimited concurrency and are priced like Lambda. Standard Workflows are for long-running, auditable business processes.
Map state vs Parallel state: Map iterates over an array (dynamic number of parallel branches based on input). Parallel runs a fixed set of defined branches simultaneously. Exam questions about 'processing each item in a list' = Map state.
Step Functions does NOT replace EventBridge for event routing or SNS for fan-out. Step Functions orchestrates WORKFLOW LOGIC. EventBridge routes events. SNS fans out to multiple subscribers. These are complementary, not competing services — but exam questions test whether you know when NOT to use Step Functions.
For cost optimization questions: Standard Workflows don't charge during Wait states or Task Token waits — only state transitions cost money. A workflow that waits 6 hours for human approval costs the same as one that waits 6 seconds (in Step Functions charges). This makes Standard Workflows very cost-effective for human-in-the-loop patterns.
API Gateway + Step Functions direct integration (no Lambda): For synchronous Express Workflows called via REST API, API Gateway can integrate directly with Step Functions StartSyncExecution API. This eliminates a Lambda invocation, reducing cost and latency. Look for this pattern in cost-optimization and simplification questions.
Error handling hierarchy: Retry blocks execute BEFORE Catch blocks. Within Retry, errors are tried up to MaxAttempts with exponential backoff. Only after all retries are exhausted does Catch evaluate. This order matters for designing resilient workflows.
Activity Workers enable hybrid workflows: EC2, ECS, or on-premises servers can act as Step Functions workers by long-polling for tasks using GetActivityTask. This is the pattern for integrating legacy on-premises systems into serverless workflows during migration.
Common Mistake
Lambda's 15-minute timeout can be extended using Provisioned Concurrency, container images, or Step Functions configuration
Correct
Lambda has a HARD 15-minute maximum execution timeout regardless of how it is invoked, packaged, or configured. Provisioned Concurrency eliminates cold starts but does NOT extend the timeout. Container images and ZIP deployments have the same 15-minute limit. Step Functions does NOT extend Lambda's timeout — it works AROUND it by using ECS tasks, Activity Workers, or Task Token callbacks for operations exceeding 15 minutes.
This is the #1 Lambda/Step Functions misconception on all certification exams. The correct architecture for >15-min workloads is Step Functions + ECS Fargate (not Lambda). Any answer claiming Lambda can run longer than 15 minutes is ALWAYS wrong.
Common Mistake
Step Functions can replace SNS for fan-out messaging to multiple subscribers
Correct
Step Functions is an orchestration service, not a messaging/fan-out service. For fan-out to multiple independent subscribers (SQS queues, Lambda functions, HTTP endpoints), SNS is the correct service. Step Functions can CALL SNS as a task state, but it cannot replace SNS's pub/sub fan-out capability. Overengineering with Step Functions when simple SNS fan-out is needed is a common and costly mistake.
Exam questions test whether candidates know service boundaries. Step Functions = workflow orchestration (sequential/conditional/parallel coordination). SNS = pub/sub fan-out. SQS = point-to-point queuing. These are complementary, not interchangeable.
Common Mistake
Express Workflows provide exactly-once execution guarantees, just like Standard Workflows
Correct
Express Workflows provide AT-LEAST-ONCE execution semantics — a step may execute more than once in failure/retry scenarios. Standard Workflows provide EXACTLY-ONCE execution semantics. For financial transactions, inventory updates, or any idempotency-sensitive operation, Standard Workflows are required. Express Workflows require your tasks to be idempotent by design.
This distinction is critical for data integrity questions. If you see 'payment processing', 'database write', or 'inventory deduction' in an exam question, exactly-once semantics (Standard Workflows) is required. At-least-once (Express) is acceptable for idempotent operations like sending metrics or logging.
Common Mistake
Step Functions execution history is available for all workflow types in the AWS Console
Correct
Only Standard Workflows maintain full execution history viewable in the AWS Step Functions console (up to 25,000 events). Express Workflows do NOT store execution history in the console — they send all execution logs to Amazon CloudWatch Logs. To debug Express Workflow executions, you MUST have CloudWatch Logs enabled and query CloudWatch Logs Insights.
Operations and debugging questions test this. If you need a visual execution history for debugging, Standard Workflows are required. Express Workflows require CloudWatch Logs setup for observability — forgetting this in production leads to undebuggable failures.
Common Mistake
EventBridge should always be used to trigger Step Functions for complex, multi-step workflows because it provides better decoupling
Correct
While EventBridge CAN trigger Step Functions, it is not always necessary or appropriate. For simple, direct invocations from a known source (API Gateway, Lambda, SDK call), invoking Step Functions directly is simpler and has lower latency. EventBridge adds value when you need event routing, pattern matching across multiple sources, or decoupling from the event producer. Overengineering with EventBridge when a direct StartExecution API call suffices is an anti-pattern tested in cost and simplicity optimization questions.
The exam tests architectural judgment. EventBridge is NOT always required as a trigger — use it when its event routing, filtering, or decoupling capabilities are genuinely needed. Adding EventBridge purely for 'decoupling' when there is a single known event source adds unnecessary complexity and cost.
Common Mistake
The Map state in Step Functions processes items sequentially (one at a time)
Correct
The Map state processes array items IN PARALLEL by default. You can control parallelism with the MaxConcurrency parameter — setting it to 1 makes it sequential, setting it to 0 means unlimited parallelism. The default (MaxConcurrency not set or 0) is maximum parallelism. This is a common exam trap in ETL and batch processing questions.
If an exam question asks how to process 1,000 S3 objects concurrently, Map state with MaxConcurrency=0 (unlimited) is the answer. If it asks for sequential processing to avoid DynamoDB throttling, Map state with MaxConcurrency=1 is correct. Knowing you can control this is essential.
Common Mistake
Step Functions is only useful for serverless (Lambda-based) architectures
Correct
Step Functions can orchestrate virtually any compute resource: Lambda, ECS/Fargate containers, EC2 instances (via Activity Workers), on-premises servers (via Activity Workers), and 200+ AWS services via direct SDK integrations. It is equally valuable in containerized, hybrid, and traditional architectures. The 'serverless' label refers to Step Functions itself being managed infrastructure — not the workloads it orchestrates.
Migration and modernization questions (SAP-C02, DOP-C02) often involve Step Functions orchestrating a mix of legacy on-premises systems (Activity Workers) and new cloud services. Knowing Step Functions is compute-agnostic is essential for hybrid architecture questions.
SEACH = Standard: Exactly-once, Audit history, 1 year, Charged per transition, History in console | EXPRESS = at-least-once, 5-min, Rapid, Express-way pricing (duration+count), Sends logs to CloudWatch
The 3 Integration Suffixes: .sync (wait for job) | .waitForTaskToken (wait for callback) | [none] (fire and forget) — 'SWT: Sync Waits, Token waits, nothing fires'
256 KB Rule: 'If your data is BIGGER than a book chapter (256 KB), PUT IT IN S3 and pass the reference' — applies to ALL state machine input/output
Lambda's 15-minute WALL: 'Lambda hits the WALL at 15 minutes — Step Functions goes AROUND the wall using ECS, Activity Workers, or Task Tokens'
Map vs Parallel: 'MAP = Many items from Array Processed in parallel' | 'PARALLEL = Pre-defined branches All Run at once' — Map is dynamic, Parallel is static
CertAI Tutor · SAA-C03, SAP-C02, DEA-C01, DOP-C02, CLF-C02, DVA-C02 · 2026-02-21
In the Same Category
Comparisons
Guides & Patterns