
Cargando...
Choose the right analytics engine — from serverless SQL to petabyte warehouses to big data frameworks
Provisioned warehouse vs serverless SQL vs managed big data frameworks — pick the right tool or fail the exam
| Feature | Redshift Petabyte-scale managed data warehouse | Athena Serverless SQL directly on S3 | EMR Managed big data frameworks at scale |
|---|---|---|---|
Core Purpose Redshift = warehouse (data loaded in), Athena = query in place (data stays in S3), EMR = custom big data processing | Structured data warehouse for BI, reporting, and complex analytics on loaded data | Ad-hoc interactive SQL queries directly against data in S3 — no loading required | Run open-source big data frameworks (Spark, Hive, Hadoop, Presto, HBase) at petabyte scale |
Infrastructure Model Athena is ALWAYS serverless. Redshift has BOTH provisioned and serverless options. EMR has both EC2-based clusters and EMR Serverless. | Provisioned cluster (RA3, DC2 node types) OR Redshift Serverless — you choose | Fully serverless — zero infrastructure to manage, ever | Managed cluster of EC2 instances (Master, Core, Task nodes) OR EMR Serverless |
Data Location Critical: Athena requires NO data loading. If a question mentions 'query data already in S3 without moving it,' the answer is Athena. | Data loaded INTO Redshift storage (columnar, compressed); Redshift Spectrum can query S3 externally | Data stays IN S3 — Athena queries it in place; no data movement required | Reads from S3, HDFS, DynamoDB, HBase, or local instance storage; writes back to S3 or HDFS |
Query Language / Interface Python UDFs in Redshift are being deprecated — new ones cannot be created after Patch 198. This is a live doc fact and may appear on exams. | PostgreSQL-compatible SQL (ANSI SQL); also supports Python UDFs (deprecated after Patch 198, end June 2026) | Standard SQL (Presto/Trino engine) for SQL queries; Apache Spark (Python) for notebook-based analytics | Spark (Scala, Python, Java, R), Hive SQL, Pig, HBase, Presto, Flink, and more — multi-framework |
Pricing Model Athena's $5/TB scanned model rewards partitioning and columnar formats (Parquet/ORC). Redshift is cost-effective for frequent, repeated queries. EMR with Spot Instances is cheapest for batch workloads. | Provisioned: per node-hour (RA3/DC2); Serverless: per Redshift Processing Unit (RPU) per second; storage billed separately on RA3 | SQL: $5 per TB of data scanned; Spark: per DPU-hour; NO charge when idle — pay only for queries run | EC2 instance costs + EMR premium per instance-hour; EMR Serverless billed per vCPU-hour and GB-hour; Spot Instances can reduce costs 50–90% |
Cost Optimization Techniques Converting data to Parquet before querying with Athena is a top exam scenario — it reduces cost AND improves performance simultaneously. | Reserved Instances (1 or 3 year), Pause/Resume clusters, Serverless auto-scaling, compression encodings, materialized views | Partition pruning, columnar formats (Parquet/ORC reduce scan cost dramatically), compress data, use workgroups to set data scan limits | Spot Instances for Task/Core nodes, auto-scaling, instance fleets, EMR Serverless for intermittent workloads, S3 as persistent storage |
Performance Characteristics If the scenario involves a BI dashboard with many concurrent users running the same queries repeatedly, Redshift wins due to result caching and MPP architecture. | Fastest for repeated, complex queries on loaded data; columnar storage, MPP, result caching, materialized views; sub-second to seconds | Good for ad-hoc queries; performance depends on data format, partitioning, and S3 layout; seconds to minutes | Best for large-scale batch transformations; can be tuned for low latency with Spark Streaming; highly configurable |
Concurrency Redshift WLM max 50 concurrency slots (user-defined manual queues) is a documented hard limit. Athena concurrency is region-specific and adjustable via Service Quotas. | WLM (Workload Management) controls concurrency; max 50 query slots across all user-defined manual WLM queues; Concurrency Scaling adds burst capacity | Active DML queries quota (SELECT, CTAS, INSERT INTO) varies by region; Active DDL queries quota also applies; queries may be queued | Concurrency limited by cluster size (number of YARN containers); horizontally scalable by adding nodes |
Redshift Spectrum / External Tables Redshift Spectrum + Athena can share the SAME Glue Data Catalog. This is a critical integration pattern tested on SAA-C03 and SAP-C02. | Redshift Spectrum allows querying S3 data directly from Redshift SQL — bridges Redshift and Athena use cases | Natively queries S3; uses AWS Glue Data Catalog as metastore — same catalog Redshift Spectrum can use | EMR can also use AWS Glue Data Catalog; Hive Metastore or Glue catalog for table definitions |
Supported Frameworks / Engines EMR is the ONLY service that supports non-SQL big data frameworks like Hadoop MapReduce, Pig, and HBase. If the question mentions these, the answer is EMR. | PostgreSQL-compatible SQL engine only (plus Redshift ML via SageMaker integration) | Presto/Trino (SQL queries) + Apache Spark (Python notebooks) — two distinct engines | Spark, Hadoop MapReduce, Hive, Pig, HBase, Presto, Flink, Livy, Zeppelin, JupyterHub, TensorFlow, and more |
Machine Learning Integration Redshift ML lets data analysts run ML predictions using pure SQL — no Python required. This is a differentiator tested on DEA-C01. | Redshift ML — create, train, and invoke SageMaker models using SQL CREATE MODEL statements | No native ML training; can query ML inference results stored in S3 | Native support for Spark MLlib, TensorFlow on Spark, and direct SageMaker integration via EMR notebooks |
Startup / Provisioning Time For 'run a query right now with no setup,' Athena is always the answer. EMR cluster startup time disqualifies it for interactive/ad-hoc scenarios. | Provisioned cluster: minutes to provision; Serverless: near-instant (cold start seconds); Pause/Resume: minutes | Instant — no provisioning ever; queries start in seconds | EC2 cluster bootstrap: 5–15+ minutes; EMR Serverless: faster but still has warm-up; not suitable for sub-minute SLAs |
Data Formats Supported Athena has native support for AWS log formats (CloudTrail, VPC Flow Logs, ALB logs) — a common exam scenario for security analysis without ETL. | CSV, JSON, Avro, Parquet, ORC, fixed-width, delimited (for COPY command); columnar internally | CSV, TSV, JSON, Parquet, ORC, Avro, Ion, Regex SerDe, Grok, CloudTrail logs, ALB/ELB logs, VPC Flow Logs, and more | Any format supported by the framework — Parquet, ORC, Avro, JSON, CSV, Sequence files, HBase, Delta Lake, Apache Iceberg, Apache Hudi |
Table Formats (Data Lake) For building and maintaining open table formats (Iceberg/Hudi/Delta), EMR is the most capable. Athena can query them but EMR is preferred for writing. | Supports Apache Iceberg via Redshift Spectrum and managed storage | Native support for Apache Iceberg, Apache Hudi, Delta Lake via Lake Formation | Full native support for Apache Iceberg, Apache Hudi, Delta Lake — best for writing/managing these formats |
Security & Access Control Lake Formation integrates with all three for fine-grained data access control — a key pattern for data mesh and multi-team data lake architectures. | VPC, IAM, database users/groups, row-level security, column-level security, KMS encryption, SSL, AWS Lake Formation integration | IAM policies, workgroup-level controls, S3 bucket policies, KMS encryption, Lake Formation column/row filtering, VPC endpoints | IAM roles (EC2 instance profile + EMR role), Kerberos, Apache Ranger, Lake Formation, VPC, security groups, KMS, TLS in transit |
Backup & Durability | Automated snapshots (1-day default retention, up to 35 days), manual snapshots, cross-region snapshot copy; 3x data replication within cluster | No backup needed — data lives in S3 (11 nines durability); query history retained in S3 | Transient clusters store results in S3; HDFS data lost when cluster terminates; use S3 as persistent layer |
Use Case Fit | BI dashboards, complex multi-join queries, regular reporting, data warehouse migration from on-prem, structured data analytics | Ad-hoc log analysis, one-time queries, querying raw S3 data, serverless ETL (CTAS), security investigations, cost-sensitive infrequent queries | ETL pipelines, ML feature engineering, graph processing, real-time streaming (Spark Streaming/Flink), custom big data applications |
Managed vs. Self-Managed EMR requires the most operational expertise. If a question emphasizes 'minimal operational overhead' or 'no cluster management,' eliminate EMR. | Fully managed (patching, backups, monitoring automated); you manage schema, WLM, and query optimization | Fully managed and serverless — zero operational overhead; AWS handles everything | AWS manages cluster provisioning and framework installation; YOU manage cluster sizing, configuration tuning, and application code |
Glue Data Catalog Integration All three services can share the SAME Glue Data Catalog — enabling a unified metadata layer across your entire data lake. This is a critical architecture pattern. | Supports Glue Data Catalog via Redshift Spectrum for external tables | Uses Glue Data Catalog as its primary metastore — tightly integrated by default | Can use Glue Data Catalog as a drop-in replacement for Hive Metastore |
Streaming / Real-Time Redshift Streaming Ingestion is a newer feature for near-real-time warehousing. For true streaming analytics, EMR with Flink/Spark Streaming is the answer. | Redshift Streaming Ingestion (from Kinesis Data Streams and MSK/Kafka) — near real-time ingestion directly into Redshift | Not designed for real-time; queries S3 snapshots; can query Kinesis data via connector but not native streaming | Spark Streaming, Apache Flink on EMR — full real-time and micro-batch streaming support |
Exam Frequency | Very high — appears in nearly every data/analytics scenario on SAA-C03, SAP-C02, DEA-C01 | Very high — #1 most tested analytics service on CLF-C02, SAA-C03, DEA-C01 | High — tested for ETL, Hadoop migration, and big data framework scenarios on SAA-C03, DEA-C01, SAP-C02 |
Summary
Use Redshift when you have structured data that needs frequent, complex SQL queries from BI tools and dashboards — load the data in and get fast repeated performance. Use Athena when your data already lives in S3 and you need serverless, pay-per-query SQL without any infrastructure — ideal for ad-hoc analysis and log investigations. Use EMR when you need open-source big data frameworks (Spark, Hive, Hadoop), custom processing pipelines, ML feature engineering, or streaming — and you're comfortable managing cluster configuration.
🎯 Decision Tree
IF data is already in S3 AND queries are infrequent/ad-hoc → Athena | IF structured data needs frequent BI/dashboard queries with many concurrent users → Redshift | IF you need Spark/Hadoop/Hive/custom big data frameworks OR ETL pipelines → EMR | IF question mentions 'no infrastructure management' or 'serverless' → Athena (or Redshift Serverless for warehouse) | IF question mentions 'migrate on-prem data warehouse' → Redshift | IF question mentions 'query S3 logs (VPC Flow, CloudTrail, ALB)' → Athena | IF question mentions 'Hadoop MapReduce' or 'Pig' or 'HBase' → EMR only | IF question mentions 'petabyte data warehouse with SQL' → Redshift | IF cost optimization for infrequent large scans → Athena with Parquet + partitioning
DECISION RULE: Data in S3 + ad-hoc/infrequent queries = Athena. Structured data + frequent BI/dashboards + concurrent users = Redshift. Custom big data frameworks (Spark/Hadoop/Hive/HBase) + ETL pipelines = EMR. Memorize this triangle.
Athena cost optimization: Converting to Parquet + adding partitions can reduce query cost by 60-90%. This single optimization tip appears in multiple exam scenarios across CLF-C02, SAA-C03, and DEA-C01.
Athena Active DML quota counts RUNNING + QUEUED queries. Redshift WLM max 50 slots (all user-defined manual queues). EMR HDFS is ephemeral. These three limits are the most exam-tested constraints across all three services.
All three services share the AWS Glue Data Catalog — Athena (primary metastore), Redshift Spectrum (external tables), EMR (Hive Metastore replacement). This unified metadata layer is a core data lake architecture pattern.
Redshift Python UDFs: Cannot create new ones after Patch 198; existing ones deprecated June 30, 2026. EMR Task nodes are safe on Spot (no HDFS). Redshift Serverless = no charge when idle. These are the 'gotcha' facts that differentiate passing from failing.
The #1 exam trap: Choosing EMR when Athena is the correct answer. Candidates see 'large dataset' or 'complex query' and jump to EMR, but if the data is already in S3 and the requirement is SQL queries (even complex ones), Athena is correct — it's serverless, requires no cluster, and scales automatically. EMR is only correct when you need non-SQL frameworks (Spark, Hadoop, Hive), custom processing logic, or existing Hadoop workload migration.
CertAI Tutor · DEA-C01, CLF-C02, DOP-C02, SAA-C03, SAP-C02, SCS-C02, DVA-C02 · 2026-02-22
Services
Comparisons