
Cargando...
Fully managed, auto-scaling real-time data delivery to AWS storage and analytics destinations — no servers, no shards, no operations.
Amazon Kinesis Data Firehose (now branded as Amazon Data Firehose) is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services. It automatically scales to match the throughput of your data and requires no ongoing administration. Firehose can batch, compress, transform, and encrypt data before loading, minimizing storage costs and increasing security.
Near-real-time ETL pipeline that continuously captures, transforms, and delivers streaming data to destinations like S3, Redshift, OpenSearch, Splunk, and HTTP endpoints — without managing infrastructure.
Use When
Avoid When
Automatic scaling (no shard management)
Scales throughput automatically — fundamental differentiator vs. Kinesis Data Streams
Data transformation via AWS Lambda
Invoke Lambda inline to parse, enrich, filter, or convert records before delivery
Format conversion (JSON → Parquet / ORC)
Converts JSON to columnar formats using AWS Glue schema — reduces S3 storage cost and improves Athena query performance
Compression (GZIP, Snappy, ZIP, Hadoop-compatible GZIP)
Applied at delivery time to reduce storage footprint
Server-side encryption (SSE)
Encrypts data at rest using AWS KMS CMK or AWS managed keys
Dynamic partitioning (S3)
Routes records to S3 prefixes based on record content (e.g., by customer ID or event type) — enables efficient data lake partitioning
Data replay / record retention
Firehose does NOT retain records after delivery — use Kinesis Data Streams if replay is required
Multiple concurrent consumers
One delivery stream = one primary destination; fan-out requires Kinesis Data Streams
Source: Kinesis Data Streams
Firehose can read from a Kinesis Data Stream as its source, enabling a Streams → Firehose → S3/Redshift pipeline
Source: Direct PUT (SDK/Agent/CloudWatch/IoT)
Producers can write directly to Firehose using PutRecord / PutRecordBatch APIs
Source: Amazon MSK / Apache Kafka
Firehose can consume from MSK topics as a managed source
Destination: Amazon S3
Primary and most common destination; supports dynamic partitioning and format conversion
Destination: Amazon Redshift
Firehose first writes to S3, then issues a Redshift COPY command — not a direct streaming insert
Destination: Amazon OpenSearch Service
Direct delivery to OpenSearch indices for log analytics
Destination: Splunk
Native integration via HTTP Event Collector (HEC)
Destination: HTTP Endpoint (custom)
Deliver to any HTTP/HTTPS endpoint — enables third-party SaaS integrations (Datadog, New Relic, Dynatrace)
Destination: Amazon OpenSearch Serverless
Serverless OpenSearch collections supported as a destination
S3 backup for all records or failed records only
Configure S3 backup to capture all records or only failed delivery records for reprocessing
CloudWatch metrics and monitoring
Emits metrics for incoming bytes, delivery success/failure, throttled records, and more
Streams → Firehose → S3 Data Lake
high freqKinesis Data Streams handles real-time processing and fan-out; Firehose reads from the stream and continuously delivers batched, compressed, optionally converted data to S3. Use when you need both real-time processing AND durable storage.
Inline Data Transformation Pipeline
high freqFirehose invokes Lambda to transform/enrich records (e.g., parse JSON, mask PII, add metadata) before delivering to S3. Failed transformations are sent to an S3 error prefix. Classic serverless ETL pattern.
Log Analytics Pipeline
high freqCloudWatch Logs subscription filter sends log events to Firehose, which delivers them to OpenSearch for real-time log search and dashboards. Common for centralized logging architectures.
Near-Real-Time Data Warehouse Loading
high freqFirehose buffers streaming data to S3, then automatically issues a Redshift COPY command to load data into Redshift tables. The S3 intermediate step is mandatory — Firehose does NOT insert directly into Redshift.
IoT Telemetry Ingestion
medium freqIoT Core rules engine routes device telemetry directly to Firehose for batched delivery to S3. Enables cost-effective IoT data lake ingestion without custom consumers.
Serverless Analytics Pipeline
medium freqFirehose converts JSON to Parquet/ORC using Glue Data Catalog schema, partitions data dynamically in S3, and Athena queries it efficiently. Reduces query cost by up to 87% vs. raw JSON.
SIEM Integration
medium freqFirehose delivers security events, VPC Flow Logs, or CloudTrail data to Splunk via HEC. Native integration handles authentication, retry, and error backup to S3.
Firehose is NOT real-time — it is 'near-real-time' due to buffering. The minimum latency is ~60 seconds historically, but with buffer interval set to 0, delivery is triggered by buffer size alone. Exam questions about 'real-time' processing should point to Kinesis Data Streams + Lambda, not Firehose.
Firehose to Redshift ALWAYS goes through S3 first. Firehose writes to an S3 bucket, then issues a COPY command to Redshift. This is not a direct streaming insert. If an exam question says 'directly load to Redshift,' the answer involves this two-step S3 intermediate.
Firehose automatically scales — you never provision shards or capacity. This is the #1 architectural differentiator from Kinesis Data Streams. When a scenario requires 'no capacity management' or 'automatic scaling,' Firehose wins over Kinesis Data Streams.
Firehose has NO replay capability. Once data is delivered to the destination, it cannot be re-read from Firehose. If a scenario requires replaying data or multiple consumers reading at different offsets, the answer is Kinesis Data Streams (with its 1–365 day retention).
Firehose is NEAR-real-time (buffered), NOT real-time. For millisecond/sub-second latency, use Kinesis Data Streams + Lambda. Firehose trades latency for zero-ops simplicity.
Firehose to Redshift ALWAYS stages through S3 first — Firehose writes to S3, then issues a COPY command. There is NO direct streaming insert to Redshift.
Firehose auto-scales with NO shard management. Kinesis Data Streams requires manual shard provisioning. 'Least operational overhead for streaming delivery' = Firehose.
Lambda transformation timeout in Firehose is capped at 3 minutes per batch invocation — not Lambda's normal 15-minute maximum. Records that exceed this timeout are treated as processing failures and routed to the S3 error prefix.
Dynamic partitioning allows Firehose to route records to different S3 prefixes based on record content (e.g., customer_id, event_type). This is critical for building efficient data lake partitioning strategies without post-processing.
Format conversion (JSON → Parquet/ORC) requires an AWS Glue Data Catalog table schema. This is a serverless columnar conversion that dramatically reduces S3 storage and Athena query costs — know this pattern for cost optimization scenarios.
Firehose can receive data from Kinesis Data Streams as a source. This hybrid pattern lets you have real-time consumers (Lambda/KCL) on the stream AND durable S3 delivery via Firehose — both reading from the same stream simultaneously.
For failed deliveries, Firehose sends records to a configurable S3 error prefix — NOT back to the source stream. Always design a dead-letter / error reprocessing strategy using this S3 error bucket.
Common Mistake
Amazon Kinesis Firehose delivers data in real-time (millisecond latency) like Kinesis Data Streams.
Correct
Firehose is near-real-time due to mandatory buffering. Data is held in a buffer until either the size threshold or time interval is reached before delivery. Minimum practical latency is typically 60+ seconds for most configurations.
Exam questions frequently use 'real-time' as a distractor. If the scenario demands sub-second or millisecond processing, the answer is Kinesis Data Streams + Lambda/KCL, not Firehose. Firehose trades latency for simplicity and cost.
Common Mistake
Firehose inserts data directly into Amazon Redshift tables as records arrive.
Correct
Firehose ALWAYS stages data in Amazon S3 first, then issues a Redshift COPY command to load the data. There is no direct streaming insert path to Redshift from Firehose.
This two-step process means there is additional latency for Redshift loading, and you need appropriate S3 permissions AND Redshift COPY permissions configured. Exam questions about 'streaming directly into Redshift' are testing whether you know this mandatory S3 intermediate step.
Common Mistake
Kinesis Data Firehose and Kinesis Data Streams are interchangeable — both do the same thing.
Correct
They are fundamentally different: Kinesis Data Streams is a real-time, durable, replayable stream with manual shard capacity management and multiple consumers. Firehose is a fully managed, auto-scaling delivery service with no replay, one destination, and built-in ETL — optimized for loading data into storage/analytics services.
This is the most common Kinesis confusion on all three exams. The decision framework: Need replay? → Data Streams. Need fan-out to many consumers? → Data Streams. Need zero-ops delivery to S3/Redshift/OpenSearch with transformation? → Firehose. Need both? → Data Streams as source feeding Firehose.
Common Mistake
You need to manage capacity (like shards) for Firehose to handle traffic spikes.
Correct
Firehose is fully serverless and auto-scales automatically. You never provision, split, or merge shards. This is explicitly one of its core value propositions over Kinesis Data Streams.
Exam scenarios asking 'which service requires the least operational overhead for streaming data delivery' should always point to Firehose over Kinesis Data Streams for the delivery/loading use case.
Common Mistake
Firehose Lambda transformation can run for up to 15 minutes (the standard Lambda maximum timeout).
Correct
Firehose Lambda transformation is capped at 3 minutes per invocation. This is a Firehose-specific constraint, not a general Lambda constraint. Records that time out are treated as transformation failures.
Candidates who know Lambda's 15-minute max assume it applies everywhere Lambda is used. Firehose imposes its own stricter timeout. This matters for complex transformation logic — if processing might take longer than 3 minutes, you need a different architecture.
FIREHOSE = 'Fire and Forget, Hose it to Storage' — you send data, it buffers, transforms, and delivers automatically. No replay, no shards, no ops.
Redshift via Firehose = 'S3 is Always the Middleman' — Firehose → S3 → COPY → Redshift. Never direct.
Firehose vs. Streams decision: 'FORD' — Firehose: One destination, Real-time NOT needed, Delivery managed, no replay. Streams: Fan-out, Ordering guaranteed, Replay available, Developer-managed capacity.
Buffer triggers: 'Size or Time — First to Cross the Line' — whichever threshold (buffer size OR buffer interval) is hit first triggers delivery to the destination.
CertAI Tutor · SAA-C03, DVA-C02, CLF-C02 · 2026-02-21