analytics

Data Lake Architecture on AWS: The Complete Blueprint

Master the layered, service-agnostic framework that powers every AWS analytics certification question

Updated 2026-02-22

Overview

A data lake on AWS is a centralized repository built primarily on Amazon S3 that stores structured, semi-structured, and unstructured data at any scale, enabling diverse analytics workloads without predefined schemas. Understanding data lake architecture is critical for AWS certifications because it ties together S3, Glue, Lake Formation, Athena, EMR, Redshift Spectrum, and IAM into a cohesive pattern that appears across SAA-C03, SAP-C02, DEA-C01, and DAS-C01 exams. Exam questions test your ability to select the right ingestion, cataloging, transformation, and consumption layer components for a given scenario.

Understand how each AWS service maps to a specific data lake layer (ingest → store → catalog → process → consume → govern) so you can answer scenario-based questions about which service combination solves a given business requirement most cost-effectively and securely.

Patterns & Strategies

Lambda Architecture (Batch + Speed Layer)

Combines a batch processing layer (EMR, Glue ETL, Spark) for high-throughput historical data with a real-time speed layer (Kinesis Data Streams, Kinesis Data Firehose, MSK) for low-latency streaming. Results from both layers are merged at the serving layer (Redshift, Athena, OpenSearch).

✓

When the business requires both historical trend analysis and near-real-time dashboards simultaneously — e.g., fraud detection that needs both historical patterns and live transaction scoring.

⚠

Operational complexity of maintaining two code paths; data duplication between batch and speed layers increases cost; eventual consistency between layers can cause temporary discrepancies.

Kappa Architecture (Streaming-Only)

Eliminates the separate batch layer by reprocessing historical data through the same streaming pipeline (Kinesis, MSK/Kafka). All data flows through a single code path, stored in S3 and optionally in a time-series store.

✓

When the streaming pipeline can handle full historical reprocessing and operational simplicity is prioritized over raw batch throughput — e.g., IoT telemetry pipelines where the event stream is the system of record.

⚠

Reprocessing large historical datasets through a streaming system can be expensive and slow; Kinesis Data Streams retention limits mean very old data must be replayed from S3.

Medallion Architecture (Bronze / Silver / Gold)

Organizes S3 into three zones: Bronze (raw ingested data, immutable), Silver (cleaned, deduplicated, conformed data after Glue or EMR transformation), and Gold (business-level aggregates optimized for consumption by Athena, Redshift Spectrum, or QuickSight). Lake Formation governs access at each zone boundary.

✓

When data quality, lineage, and governance are top priorities — common in regulated industries (finance, healthcare). This is the dominant pattern for DEA-C01 and DAS-C01 exam scenarios.

⚠

Storage costs multiply as data is copied across zones; requires disciplined naming conventions and Lake Formation permissions at each zone; transformation latency increases time-to-insight.

Data Lakehouse (Lake + Warehouse Convergence)

Combines the flexibility of a data lake (S3 as storage) with the performance and ACID transactions of a data warehouse using open table formats (Apache Iceberg, Hudi, Delta Lake) on S3, queryable by Athena, EMR, and Redshift Spectrum. AWS Glue Data Catalog acts as the unified metastore.

✓

When you need ACID transactions, time-travel queries, schema evolution, and upserts (CDC from RDS/Aurora via DMS or Glue) on data lake storage without migrating to a full warehouse — e.g., replacing nightly full-table reloads with incremental CDC merges.

⚠

Open table format compaction jobs add compute cost; Iceberg/Hudi support varies by engine (verify Athena engine version 3 for full Iceberg support); adds metadata complexity to the Glue catalog.

Hub-and-Spoke Data Mesh

Decentralizes data ownership across business domains (spokes), each maintaining its own S3 prefix/bucket and Glue catalog, while a central governance layer (Lake Formation, AWS RAM for cross-account sharing) provides discoverability and policy enforcement. Amazon DataZone provides the data marketplace catalog layer.

✓

When the organization is large, domain teams own their data products, and a central team cannot scale to manage all pipelines — common in SAP-C02 organizational design questions.

⚠

Cross-account Lake Formation permissions require careful IAM and RAM configuration; data product quality is only as good as each domain team's standards; discoverability requires investment in DataZone or a custom catalog.

Serverless Analytics Lake

Fully serverless pipeline: S3 for storage, Glue Crawlers for cataloging, Glue ETL (serverless) or Glue DataBrew for transformation, Athena for ad-hoc SQL queries, and QuickSight for visualization. No servers to manage, pay-per-use pricing throughout.

✓

When workloads are intermittent, team size is small, or cost optimization for sporadic queries is paramount — ideal for startups or departmental analytics with unpredictable query patterns.

⚠

Athena cold-start latency is unsuitable for sub-second SLAs; Glue job startup time adds latency vs. persistent EMR clusters; cost can exceed EMR for very high, sustained query volumes.

Decision Framework

• STEP 1 — What is the latency requirement?

• Sub-second / real-time → Kinesis Data Streams + Lambda or MSK (speed layer)

• Minutes (near-real-time) → Kinesis Data Firehose → S3 → Athena

• Hours/daily (batch) → AWS Glue ETL or EMR → S3

• STEP 2 — What is the query pattern?

• Ad-hoc SQL by analysts, no infrastructure → Athena (pay per query, serverless)

• Complex SQL, concurrent BI users, high performance → Redshift (provisioned or Serverless)

• Machine learning / custom Spark → EMR or Glue Spark

• Full-text search / log analytics → OpenSearch Service

• STEP 3 — Do you need ACID / CDC / upserts?

• Yes → Use Apache Iceberg or Hudi on S3 with Athena Engine v3 or EMR

• No → Standard S3 Parquet/ORC partitioned by date is sufficient

• STEP 4 — What are the governance requirements?

• Fine-grained column/row-level security → AWS Lake Formation (column-level security, row filters)

• Cross-account data sharing → Lake Formation + AWS RAM

• Data marketplace / data products → Amazon DataZone

• Basic bucket-level isolation → S3 bucket policies + IAM

• STEP 5 — What is the cost model?

• Sporadic, unpredictable → Serverless (Athena, Glue serverless, Redshift Serverless)

• Sustained, high-throughput → Provisioned EMR or Redshift with Reserved Instances

• Long-term archival raw zone → S3 Glacier Instant Retrieval or Glacier Deep Archive

• STEP 6 — Catalog needed?

• Always → AWS Glue Data Catalog (central metastore for Athena, EMR, Redshift Spectrum, Glue)

• Hive Metastore migration → Glue can replace or federate with external Hive Metastore

Exam Tips

criticalGlue Data Catalog, Athena, Redshift Spectrum

AWS Glue Data Catalog is the CENTRAL metastore for the entire data lake — Athena, EMR, Redshift Spectrum, and Glue ETL all read from it. Any question asking how to make data 'discoverable' or 'queryable' across multiple services points to Glue Data Catalog + Glue Crawlers.

criticalLake Formation, IAM, S3

AWS Lake Formation sits ON TOP of S3 and Glue Data Catalog — it does NOT replace them. Lake Formation adds fine-grained access control (column-level security, row-level filters, cell-level security) that S3 bucket policies alone cannot provide. If a question mentions 'column-level security on data lake', the answer is Lake Formation.

criticalAthena, S3, cost optimization

Athena charges per query based on data scanned. To reduce cost AND improve performance: partition your S3 data (by date, region, etc.), use columnar formats (Parquet or ORC), and use compression (Snappy, GZIP). These three techniques appear together in cost-optimization exam questions.

criticalKinesis Data Firehose, Kinesis Data Streams

Kinesis Data Firehose is the ONLY Kinesis service that natively delivers data directly to S3, Redshift, OpenSearch, and Splunk without custom code. Kinesis Data Streams requires a consumer (Lambda, KCL app, Flink) to write to S3. Confusing these two is the #1 Kinesis trap.

critical

Lake Formation = fine-grained security (column/row level) on top of S3 + Glue Catalog. It does NOT replace IAM — both must grant access. 'Column-level security on data lake' always means Lake Formation.

critical

Glue Data Catalog is the universal metastore connecting Athena, EMR, Redshift Spectrum, and Glue ETL. Any question about making S3 data 'discoverable' or 'queryable' by multiple services = Glue Crawlers + Glue Data Catalog.

critical

Athena cost optimization = Partition data in S3 + Use Parquet/ORC columnar format + Enable compression (Snappy/GZIP). These three together reduce data scanned and therefore cost. Always apply all three.

importantGlue ETL, EMR, Spark

AWS Glue ETL is serverless Spark — you pay only for DPU-hours when the job runs. EMR gives you full cluster control (Spark, Hive, Presto, HBase) with persistent or transient clusters. Choose Glue for simplicity and intermittent jobs; choose EMR when you need custom libraries, specific Spark versions, or cost optimization at scale with Spot Instances.

importantRedshift Spectrum, S3, Glue Data Catalog

Redshift Spectrum lets Redshift query S3 data directly using the Glue Data Catalog WITHOUT loading data into Redshift. This is the 'hot/cold' tiering pattern: recent/hot data in Redshift tables, historical/cold data in S3 queried via Spectrum. A question about 'querying petabytes of S3 data with SQL alongside Redshift tables' = Redshift Spectrum.

importantKinesis Firehose, Glue, Athena

For real-time data lake ingestion, the canonical AWS pattern is: Source → Kinesis Data Firehose → (optional Lambda transform) → S3 (Parquet via Firehose format conversion) → Glue Crawler → Athena. Firehose can convert JSON to Parquet/ORC in-flight using the Glue Data Catalog schema — no separate ETL job needed.

importantApache Iceberg, Athena, EMR, ACID

Apache Iceberg on S3 (via Athena Engine v3 or EMR) enables ACID transactions, time-travel queries, and schema evolution on data lake storage. If a question asks how to support 'upserts', 'CDC merges', or 'query data as of a specific timestamp' on S3 without using Redshift, the answer involves Iceberg (or Hudi).

importantDMS, CDC, Glue, Iceberg

AWS DMS (Database Migration Service) is the standard way to replicate relational database changes (CDC) into a data lake on S3. DMS → S3 (CDC files) → Glue ETL (merge into Iceberg/Hudi) is the canonical CDC-to-data-lake pattern for exam scenarios involving 'near-real-time database replication to S3'.

Good to KnowDataZone, data mesh, Lake Formation

Amazon DataZone is the newest data governance service for data mesh scenarios — it provides a business data catalog, data portal, and data subscriptions across accounts. If an exam question mentions 'data marketplace', 'data products', or 'business users discovering data assets across accounts', DataZone is the answer (not just Lake Formation).

Common Misconceptions & Traps

Common Mistake

Lake Formation replaces S3 bucket policies and IAM — once you use Lake Formation, you only manage permissions in Lake Formation.

Correct

Lake Formation adds a permissions layer ON TOP of IAM and S3. Both Lake Formation permissions AND IAM permissions must allow access — Lake Formation uses a 'grant' model that works in conjunction with IAM, not instead of it. A principal needs BOTH IAM permissions AND Lake Formation grants to access a table.

Exam questions frequently test whether you understand that Lake Formation does not eliminate IAM. If Lake Formation grants access but IAM denies it, access is denied. The most restrictive policy wins. This causes real-world access failures and is a common exam distractor.

Common Mistake

A data lake and a data warehouse are interchangeable — you can use Redshift as a data lake.

Correct

A data lake (S3-based) stores raw, unprocessed data in native formats with schema-on-read. A data warehouse (Redshift) stores structured, processed data with schema-on-write optimized for SQL analytics. They serve different purposes and are often used together (lakehouse pattern). Redshift is NOT a data lake — it's a consumption layer.

Exam scenarios that describe 'storing raw logs, clickstreams, and IoT data for future unknown analysis' point to S3 data lake, not Redshift. Candidates who conflate the two choose Redshift for raw storage, which is expensive and inflexible.

Common Mistake

Glue Crawlers automatically keep the Data Catalog up to date whenever data changes in S3.

Correct

Glue Crawlers only update the catalog when they are explicitly run (on a schedule or manually). They do NOT trigger automatically when new data lands in S3. If you need the catalog updated immediately after ingestion, you must invoke the crawler via Lambda or use Glue's built-in scheduling, or partition projection in Athena to avoid crawlers entirely.

Candidates assume crawlers are event-driven. In reality, stale catalog entries cause Athena query failures on new partitions. The exam tests this with scenarios where 'new S3 data is not visible in Athena' — the fix is to run the crawler or use MSCK REPAIR TABLE / partition projection.

Common Mistake

Athena is suitable for any analytics use case because it's serverless and requires no infrastructure.

Correct

Athena is optimized for ad-hoc, interactive queries. It is NOT suitable for: (1) sub-second latency dashboards (use Redshift or ElastiCache), (2) OLTP workloads, (3) very high concurrency with predictable performance SLAs (use Redshift), or (4) streaming queries (use Kinesis Analytics / Managed Flink). Serverless ≠ unlimited performance.

Candidates over-apply Athena because it's simple and serverless. Exam questions that mention 'thousands of concurrent users', 'millisecond response times', or 'BI tool with consistent performance' should point to Redshift, not Athena.

Common Mistake

You must use AWS Glue for all ETL in a data lake — it's the only AWS ETL service.

Correct

AWS offers multiple ETL options: Glue ETL (serverless Spark/Python), EMR (full Hadoop/Spark cluster), Lambda (lightweight transformations, <15 min), Kinesis Data Firehose (streaming ETL with Lambda), AWS Glue DataBrew (no-code visual data preparation), and Step Functions (orchestration). The right choice depends on data volume, latency, complexity, and team skills.

Choosing Glue when EMR is more appropriate (e.g., complex multi-step Spark jobs with custom libraries) or Lambda when Glue is needed (e.g., large dataset transforms) is a common exam mistake. Match the tool to the workload characteristics described in the scenario.

Common Mistake

Data stored in S3 for a data lake is automatically encrypted and secured.

Correct

S3 encryption at rest must be explicitly configured (SSE-S3, SSE-KMS, or SSE-C). S3 buckets are private by default, but encryption is not enabled by default on all buckets (though AWS now enables SSE-S3 by default on new buckets). More importantly, encryption alone does not provide fine-grained access control — Lake Formation or S3 bucket policies are still required for authorization.

Exam questions about compliance requirements (HIPAA, PCI-DSS) require BOTH encryption (SSE-KMS for key management audit trails) AND access control (Lake Formation). Candidates who answer 'enable S3 encryption' without addressing access control miss half the requirement.

Memory Tricks

🧠

ISCPG — The 5 data lake layers in order: Ingest → Store → Catalog → Process → Govern (then Consume). Remember: 'I Store Cats Pretty Gracefully' — every question maps to one of these layers.

🧠

For Athena cost optimization remember 'PCP': Partition + Columnar format (Parquet/ORC) + Compress. Three steps, one answer, every cost question.

🧠

Firehose = FIRE it and forget (delivers directly to destination). Streams = STREAM requires a consumer to process and forward. Firehose delivers, Streams needs a reader.

🧠

Lake Formation = 'GRANT on top of IAM' — both must say YES. Think of it as a double lock: IAM key + Lake Formation key. Missing either = access denied.

🧠

Medallion layers: Bronze = Raw (don't touch), Silver = Cleaned (trust but verify), Gold = Business-ready (serve to users). B-S-G = Bad → Scrubbed → Good.

Common Trap

Candidates assume Lake Formation REPLACES IAM permissions. In reality, both Lake Formation grants AND IAM policies must allow access — the most restrictive wins. A Lake Formation grant without a corresponding IAM allow still results in access denied, and vice versa.

CertAI Tutor · · 2026-02-22

Ready to test your knowledge?

Practice exam questions with AI-powered explanations — free to start.

Data Lake Architecture on AWS: The Complete Blueprint

Overview

Patterns & Strategies

Decision Framework

Exam Tips

Common Misconceptions & Traps

Memory Tricks

Common Trap

Ready to test your knowledge?

Related Cheat Sheets