analyticsCLF-C02DEA-C01DOP-C02SAA-C03SAP-C02+1 more

Amazon Athena: Query S3 Like a Database — No Servers Required

Serverless, interactive SQL analytics directly on S3 data — pay only for bytes scanned

Updated 2026-02-22

Overview

Amazon Athena is a serverless, interactive query service that lets you analyze data stored in Amazon S3 using standard SQL — no infrastructure to provision, manage, or scale. It uses Presto/Trino under the hood, integrates natively with AWS Glue Data Catalog for schema management, and charges only for the data scanned per query. Athena is purpose-built for ad-hoc exploration, log analysis, and federated queries across heterogeneous data sources without moving data.

Enable fast, cost-effective, ad-hoc SQL querying of data at rest in Amazon S3 and federated sources without provisioning or managing any compute infrastructure.

Use When

Ad-hoc analysis of log files (CloudTrail, ELB, VPC Flow Logs, CloudFront) stored in S3
One-time or infrequent queries on large datasets where provisioning a data warehouse is cost-prohibitive
Building serverless ETL pipelines using CTAS (CREATE TABLE AS SELECT) to transform and partition S3 data
Federated queries across S3, RDS, DynamoDB, and on-premises sources via Athena Data Source Connectors
Querying AWS Glue Data Catalog tables from BI tools like Amazon QuickSight without a persistent cluster

Avoid When

High-concurrency, sub-second transactional workloads — Athena is not an OLTP engine; use Amazon RDS or Aurora instead
Repeated, complex analytics on structured relational data with heavy joins — Amazon Redshift provides better performance and cost efficiency at scale
Real-time streaming analytics — use Amazon Kinesis Data Analytics or Apache Flink; Athena queries are batch-oriented
Workloads requiring row-level updates or deletes on traditional tables — Athena is primarily read-optimized (though Iceberg tables add limited DML support)

Key Features

Serverless — no cluster to manage

Compute scales automatically; you never provision, patch, or resize nodes

Standard ANSI SQL (Presto/Trino engine)

Supports DDL, DML, window functions, CTEs, UNNEST, and geospatial functions

AWS Glue Data Catalog integration

Default metastore; tables defined in Glue are immediately queryable in Athena

Federated Query (Data Source Connectors)

Query RDS, Aurora, DynamoDB, Redshift, CloudWatch Logs, and on-premises sources via Lambda-based connectors

CTAS (CREATE TABLE AS SELECT)

Write query results back to S3 as Parquet/ORC — a core serverless ETL pattern

INSERT INTO (append to existing table)

Appends results to existing Athena tables; does not support UPDATE or DELETE on non-Iceberg tables

Apache Iceberg table support

Full ACID transactions, time travel, row-level deletes/updates on Iceberg tables — major differentiator from plain Hive tables

Apache Hudi table support

Read Hudi tables; limited write support compared to Iceberg

Apache Delta Lake table support

Read Delta Lake tables via manifest files or native support depending on engine version

Partition Projection

Computes partition values from query predicates instead of querying Glue catalog — eliminates partition listing overhead for high-cardinality time-series data

Workgroups

Isolate teams, enforce result locations, set per-query scan limits, control encryption, and track costs separately

Query result encryption

Supports SSE-S3, SSE-KMS, and CSE-KMS for results stored in S3

Prepared Statements

Parameterized queries that prevent SQL injection — use EXECUTE statement to run them

Query result reuse

Athena can reuse cached results for identical queries within a configurable TTL — reduces cost and latency

Athena for Apache Spark

Run serverless Spark notebooks and applications — separate pricing from SQL engine; uses DPU-hours

JDBC/ODBC drivers

Connect BI tools (Tableau, Power BI, QuickSight) directly to Athena via standard drivers

Cross-account S3 access

Requires S3 bucket policies and IAM role trust policies to be configured correctly

CloudTrail integration for audit

All Athena API calls are logged to CloudTrail — query history is also available in Athena console for 45 days

VPC endpoint (PrivateLink) support

Access Athena from within a VPC without traversing the public internet

Resource-based tagging

Tag workgroups for cost allocation; use tag-based IAM conditions to restrict workgroup access

Integration Patterns

S3 Data Lake Query Engine

high freq

Amazon AthenaAmazon S3

Athena's primary integration — query raw, processed, or curated data in any S3 bucket using SQL. Use columnar formats (Parquet/ORC) and Hive-style partitioning to minimize data scanned and reduce cost. Athena never moves data; it reads directly from S3.

Serverless ETL + Schema Management

high freq

Amazon AthenaAWS Glue

AWS Glue Crawlers automatically discover S3 data and populate the Glue Data Catalog, which Athena uses as its default metastore. Glue ETL jobs transform raw data into Parquet/ORC, then Athena queries the optimized output. This is the canonical serverless analytics pipeline on AWS.

Serverless BI Dashboard

high freq

Amazon AthenaAmazon QuickSight

QuickSight connects to Athena via JDBC to visualize S3 data lake content without a persistent database. Athena acts as the query layer; QuickSight renders dashboards. SPICE caching in QuickSight reduces repeated Athena query costs.

CloudTrail Log Analysis

high freq

Amazon AthenaAWS CloudTrail

Athena has a native CloudTrail integration — the console can auto-generate the CREATE TABLE DDL for CloudTrail logs stored in S3. Query API activity, detect unauthorized access, and audit resource changes using SQL without loading logs into a database.

Federated Query Across Warehouse + Lake

high freq

Amazon AthenaAmazon Redshift

Athena Federated Query can join S3 data lake tables with Amazon Redshift tables in a single SQL statement via the Redshift connector. Alternatively, use Redshift Spectrum (Redshift's own external table feature) when Redshift is the primary query engine. Know which service owns the query.

End-to-End Serverless Analytics Pipeline

high freq

Amazon AthenaAmazon S3AWS GlueAmazon QuickSight

The canonical AWS serverless analytics stack: raw data lands in S3 → Glue Crawlers catalog it → Glue ETL converts to Parquet with partitioning → Athena queries optimized data → QuickSight visualizes results. Zero persistent compute, pay-per-use at every layer.

Federated Log Analytics

medium freq

Amazon AthenaAmazon CloudWatch Logs

Using the CloudWatch Logs Athena connector (Lambda-based), query CloudWatch Log Groups directly from Athena SQL alongside S3 data. Useful for correlating application logs with S3-based event data in a single query.

Shared Metastore with EMR

medium freq

Amazon AthenaAmazon EMR

Athena and EMR can share the same AWS Glue Data Catalog, allowing Spark/Hive jobs on EMR to write data that Athena immediately queries via SQL — no schema synchronization needed. Ideal for hybrid batch + ad-hoc analytics architectures.

Fine-Grained Data Lake Access Control

medium freq

Amazon AthenaAWS Lake Formation

Lake Formation governs column-level, row-level, and cell-level security on Glue Data Catalog tables. Athena respects Lake Formation permissions — users only see data they are authorized to access, even when querying the same underlying S3 files.

Event-Driven Query Trigger

medium freq

Amazon AthenaAWS Lambda

Lambda functions can invoke Athena queries programmatically via the StartQueryExecution API, then poll for results using GetQueryExecution. Common pattern: S3 event triggers Lambda → Lambda starts Athena query → results written back to S3 for downstream processing.

Service Limits & Quotas

LimitValueNote

DDL query timeout

600 seconds

DML query timeout is separate and configurable per workgroup

DML query timeout (default)

30 minutes

This is a per-workgroup setting, not a global hard limit — candidates confuse it with a fixed ceiling

Active DML queries per account (default)

200 concurrent queries

This is a soft limit that can be raised via Service Quotas console — do not treat it as a hard ceiling

Active DDL queries per account (default)

20 concurrent queries

DDL concurrency is far lower than DML — avoid parallel schema creation scripts in automation pipelines

Query string length

262144 bytes (256 KB)

Maximum SQL query string is 256 KB — very large dynamic SQL must be broken into steps or use stored procedures in Redshift instead

Databases per catalog

No documented hard limit

Use qualitative understanding: Glue Data Catalog is the default metastore; external Hive metastores are also supported

Tables per database

No documented hard limit

Glue catalog limits apply upstream — know that Glue has a limit of 1 million tables per account

Partitions per table (Glue-backed)

No documented Athena-specific hard limit

Glue itself has a limit of 10 million partitions per table — Athena inherits this indirectly

Query result retention in S3

Configurable

Query results are stored in a designated S3 bucket; apply S3 lifecycle policies to manage cost — Athena does not auto-delete results

Workgroups per account

No documented hard limit

Workgroups are the primary governance mechanism — use them to enforce per-query data scan limits, result locations, and encryption settings

Prepared statements per workgroup

1000 statements

Prepared statements improve parameterized query security (prevent SQL injection) and are scoped per workgroup

Named queries per workgroup

1000 named queries

Named queries are saved SQL templates — useful for standardizing team queries but limited to 1000 per workgroup

Data scanned pricing threshold for cancellation

Configurable via workgroup bytes

Workgroups can enforce a per-query data scan limit that cancels runaway queries before they incur excessive cost — a critical cost-control mechanism

Pricing Model

Pay-per-query (data scanned) for SQL engine; DPU-hours for Spark engine

SQL queries: charged per TB of data scanned from S3 — currently $5.00 per TB scanned (verify current rate in AWS pricing page)
Minimum charge per query: 10 MB of data scanned — even tiny queries are billed at the 10 MB floor
Queries against data in columnar formats (Parquet, ORC) with partitioning can reduce scanned data by 30–90%, dramatically cutting costs
DDL statements (CREATE TABLE, DROP TABLE, SHOW) and failed queries are NOT charged
Athena for Apache Spark is priced per DPU-hour consumed during notebook/application execution — separate from SQL pricing
No charge for Athena query results stored in S3 — but you pay standard S3 storage and request rates for the results bucket
Federated queries incur Lambda invocation costs in addition to Athena scan costs
Query result reuse: if results are served from cache, you are NOT charged for data scanned — significant cost optimization
Cross-Region data transfer costs apply if S3 data and Athena queries are in different Regions — always use same-Region S3 buckets

Exam Tips

criticalCost optimization, columnar formats, partitioning

Athena charges for data SCANNED, not data returned. Always partition data by query predicates (date, region, account) and convert to columnar formats (Parquet, ORC) to reduce scan volume and cost — this is the #1 Athena optimization pattern tested on exams.

criticalService selection, Redshift vs Athena

Athena is NOT a replacement for Amazon Redshift. Athena = serverless, ad-hoc, pay-per-scan for S3 data. Redshift = persistent cluster, complex analytical workloads, high concurrency, columnar storage engine. Exam questions will present both as options — choose Athena for infrequent/ad-hoc, choose Redshift for frequent/complex/high-concurrency.

criticalWorkgroups, cost control, governance

Workgroups are the primary governance and cost control mechanism. They enforce: (1) per-query data scan limits that auto-cancel expensive queries, (2) result location in S3, (3) encryption settings, (4) separate cost tracking via tags. If an exam question asks how to prevent runaway Athena costs or isolate team query costs — the answer is Workgroups.

criticalGlue Data Catalog, metastore, schema discovery

AWS Glue Data Catalog is Athena's default metastore — you do NOT need to create a separate database or schema management system. Glue Crawlers auto-discover S3 data and create table definitions that Athena immediately uses. This integration is central to nearly every Athena architecture question.

criticalCloudTrail, security audit, log analysis

For CloudTrail log analysis at scale, Athena is the AWS-recommended approach. The Athena console can auto-generate the CREATE TABLE statement for CloudTrail logs. This pattern (S3 + Athena for CloudTrail) appears frequently in security and audit scenarios on SAA-C03, SAP-C02, and DEA-C01.

critical

Athena charges per TB scanned — always use Parquet/ORC columnar formats + partition pruning to reduce cost. Workgroups enforce per-query scan limits to prevent runaway costs. These three concepts (columnar format, partitioning, workgroups) are the core of every Athena optimization question.

critical

Athena ≠ Redshift: Athena is serverless, ad-hoc, pay-per-scan for S3. Redshift is a persistent cluster for high-concurrency, complex analytics. When an exam question asks you to choose between them, the deciding factors are: query frequency (infrequent = Athena), infrastructure management preference (none = Athena), and concurrency requirements (high = Redshift).

critical

AWS Glue Data Catalog is Athena's default metastore — Glue Crawlers auto-populate table schemas from S3 data. Athena, EMR, and Redshift Spectrum all share this catalog. Lake Formation adds fine-grained access control on top. This integration chain (S3 → Glue → Athena → QuickSight) is the canonical serverless analytics architecture on AWS.

importantPartition projection, performance optimization, high-cardinality partitions

Partition Projection eliminates the need to call the Glue catalog for partition metadata — Athena computes valid partition values from the query predicate directly. This is critical for time-series data with thousands or millions of partitions (e.g., hourly logs). If an exam question mentions slow query planning due to too many partitions, Partition Projection is the answer.

importantFederated query, Lambda connectors, multi-source joins

Athena Federated Query uses Lambda-based connectors to query data sources outside S3 (RDS, DynamoDB, Redshift, on-premises via JDBC). The Lambda connector runs in your account and VPC. If a question asks how to join S3 data with RDS data in a single SQL query without ETL — Athena Federated Query is the answer.

importantIceberg, ACID, row-level DML, open table formats

Apache Iceberg tables in Athena support full ACID transactions including UPDATE, DELETE, and MERGE — something plain Hive-style Athena tables do NOT support. If an exam scenario requires updating or deleting specific rows in a data lake, Iceberg (not standard Athena tables) is the correct answer.

importantQuery result reuse, caching, cost optimization

Query result reuse (caching) in Athena means identical queries within the configured TTL return cached results at zero scan cost. This is a legitimate cost optimization for dashboards with repeated queries — distinguish it from SPICE caching in QuickSight (client-side) vs Athena result reuse (engine-side).

Good to KnowPricing, failed queries, DDL statements

Failed queries in Athena are NOT charged — you only pay for successful queries that complete data scanning. DDL statements (CREATE TABLE, SHOW, DROP) are also free. This is a pricing nuance tested in CLF-C02 and SAA-C03.

Good to KnowAthena for Spark, DPU pricing, serverless Spark

Athena for Apache Spark is a SEPARATE capability from Athena SQL — it runs serverless Spark applications and notebooks, priced in DPU-hours (not per TB scanned). Do not confuse the two pricing models. If a question mentions Spark notebooks on Athena, the pricing answer is DPU-hours, not TB scanned.

Common Misconceptions & Traps

Common Mistake

Athena can provide ML-based recommendations and predictions on S3 data

Correct

Athena is a SQL query engine — it has no built-in ML inference capability. For ML on S3 data, use Amazon SageMaker. Athena can call SageMaker endpoints via the ML with Athena feature (USING FUNCTION with Lambda), but Athena itself does not perform ML.

Exam questions sometimes present Athena as an all-in-one analytics + ML service. Remember: Athena = SQL only. Any ML capability requires an external service. This misconception is especially common in CLF-C02 and DEA-C01 questions that conflate analytics and AI/ML services.

Common Mistake

AWS Glue Schema Registry controls who can access Athena query results and manages Athena permissions

Correct

AWS Glue Schema Registry manages schema evolution for streaming data (Kafka, Kinesis) — it has nothing to do with Athena query result access control. Athena result access is controlled by S3 bucket policies, IAM policies, workgroup settings, and AWS Lake Formation permissions.

This is a real exam trap: both Glue Schema Registry and Athena involve Glue, so candidates conflate them. Schema Registry = streaming schema versioning. Athena access control = IAM + S3 + Lake Formation + Workgroups. Keep them completely separate in your mental model.

Common Mistake

AWS Glue DataBrew can replace Athena as a SQL query engine for ad-hoc data analysis

Correct

AWS Glue DataBrew is a visual, no-code data preparation tool for cleaning and normalizing data — it is NOT a SQL query engine and cannot serve as an Athena replacement. DataBrew outputs transformed datasets; Athena queries data in place using SQL.

Both services involve Glue and data transformation, causing confusion. Remember: DataBrew = visual data prep (think: spreadsheet-like UI for data cleaning). Athena = SQL query engine on S3. They can complement each other (DataBrew cleans → Athena queries) but are not interchangeable.

Common Mistake

Athena stores and manages data like a traditional database — you load data into Athena

Correct

Athena is schema-on-read: data lives in S3, and Athena applies a schema at query time. You never 'load' data into Athena. Athena only stores query results (in a designated S3 bucket) and metadata (in Glue Data Catalog). The source data remains in S3, untouched.

Candidates from traditional database backgrounds assume a load step is required. This leads to wrong architecture decisions. The correct mental model: Athena = a query engine that reads S3 directly. No ETL load step into Athena is ever needed or possible.

Common Mistake

Athena and Amazon Redshift Spectrum are the same thing and can be used interchangeably

Correct

Athena is a standalone serverless SQL service that queries S3 independently. Redshift Spectrum is a feature of Amazon Redshift that allows a Redshift cluster to query external S3 data — it requires an active Redshift cluster. Athena requires no cluster at all. They share the Glue Data Catalog but are architecturally distinct.

Both query S3 using SQL and use the Glue catalog, making them appear identical. The key differentiator: Spectrum extends Redshift (cluster required, unified query across Redshift + S3), while Athena is completely standalone. Choose Athena when you want zero cluster management; choose Spectrum when Redshift is already your primary warehouse.

Common Mistake

Partitioning data in S3 automatically makes Athena queries faster without any additional configuration

Correct

Partitioning helps only if queries include predicates on the partition columns AND Athena knows about the partitions. You must either run MSCK REPAIR TABLE to load partitions into the Glue catalog, add partitions manually with ALTER TABLE ADD PARTITION, use Glue Crawlers, or enable Partition Projection. Without this, Athena scans all partitions regardless.

Many candidates partition their S3 data correctly but forget to register partitions with the metastore. The result: full table scans despite partitioning. This is a common architecture mistake tested in DEA-C01 and SAA-C03 performance optimization questions.

Memory Tricks

🧠

ATHENA = Ask The Hive Engine, No Administration — serverless SQL on S3, schema-on-read, no cluster needed

🧠

Cost control trio: Partition + Columnar + Workgroup = PCW (Pay Carefully, Win) — the three levers that minimize Athena spend

🧠

Glue = the GLUE between S3 data and Athena queries — Crawlers discover, Catalog stores schemas, Athena reads

🧠

Iceberg = ICE (Insert, Change, Erase) — the open table format that adds ACID DML to Athena's otherwise read-heavy model

🧠

Federated Query = Lambda as the BRIDGE — every non-S3 data source connector is a Lambda function in your account

CertAI Tutor · CLF-C02, DEA-C01, DOP-C02, SAA-C03, SAP-C02, DVA-C02 · 2026-02-22

Ready to test your knowledge?

Practice CLF-C02, DEA-C01, DOP-C02, SAA-C03, SAP-C02, DVA-C02 exam questions with AI-powered explanations — free to start.

Amazon Athena: Query S3 Like a Database — No Servers Required

Overview

Key Features

Integration Patterns

Service Limits & Quotas

Pricing Model

Exam Tips

Common Misconceptions & Traps

Memory Tricks

Ready to test your knowledge?

Related Cheat Sheets