analyticsDEA-C01SAA-C03SAP-C02CLF-C02

AWS Glue: The Serverless ETL Powerhouse

Fully managed, serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources for analytics and ML

Updated 2026-02-22

Overview

AWS Glue is a fully serverless data integration service that provides a unified platform for discovering, cataloging, cleaning, transforming, and moving data across data stores. It eliminates infrastructure management by automatically provisioning, configuring, and scaling the resources needed to run ETL jobs. Glue supports batch ETL workloads natively — it is NOT a real-time streaming service — and integrates deeply with the broader AWS analytics ecosystem including S3, Redshift, Athena, and EMR.

Automate and simplify the Extract, Transform, and Load (ETL) process for batch data pipelines without managing servers, making data ready for analytics and ML workloads

Use When

Building batch ETL pipelines to move and transform data between data stores (S3, RDS, Redshift, DynamoDB)
Creating and maintaining a centralized metadata catalog (Glue Data Catalog) shared across Athena, EMR, Redshift Spectrum, and Lake Formation
Discovering and classifying unknown datasets using Glue Crawlers to automatically infer schemas
Performing data quality checks, deduplication, and schema normalization before loading into a data warehouse or data lake
Orchestrating complex multi-step data workflows using Glue Workflows and Triggers
Running Apache Spark or Python Shell jobs without provisioning or managing EMR clusters

Avoid When

Real-time or near-real-time stream processing: Glue Streaming Jobs exist but Kinesis Data Streams + Kinesis Data Analytics (Apache Flink) or MSK are purpose-built for true sub-second latency streaming use cases
Ad-hoc interactive SQL queries on a data lake: Use Athena directly — Glue is the catalog backend, not the query engine
Simple file movement without transformation: AWS DataSync or S3 replication is more cost-effective for pure data movement
Built-in dashboards or visualization: Glue has no visualization layer — use Amazon QuickSight for BI on top of cataloged data
Operational database migrations: Use AWS Database Migration Service (DMS) for live database-to-database migrations with minimal downtime

Key Features

Glue Data Catalog

Centralized, persistent metadata repository. Acts as the Hive Metastore for Athena, EMR, and Redshift Spectrum. One catalog per account per region.

Glue Crawlers

Automatically scan data stores, infer schemas, and populate the Data Catalog. Support S3, JDBC, DynamoDB, DocumentDB, MongoDB, and more.

Glue ETL Jobs (Apache Spark)

Serverless Spark jobs. Auto-generates PySpark or Scala code. Supports G.1X, G.2X, G.4X, G.8X, and G.025X (Flex) worker types.

Glue Python Shell Jobs

Lightweight Python scripts without Spark. Ideal for small datasets, API calls, or orchestration logic. Uses 0.0625 or 1 DPU.

Glue Streaming ETL

Continuous ETL from Kinesis Data Streams or Apache Kafka (MSK). Built on Spark Structured Streaming. NOT for sub-second latency use cases.

Glue DataBrew

Visual, no-code data preparation tool with 250+ built-in transformations. Separate billing from Glue ETL.

Glue Data Quality

Define data quality rules using DQDL (Data Quality Definition Language). Evaluate rules during ETL jobs or as standalone runs.

Glue Workflows

Orchestrate complex ETL pipelines with multiple jobs and crawlers. Triggered by schedules, events, or on-demand. Alternative: AWS Step Functions for more complex orchestration.

Glue Triggers

Schedule-based, on-demand, or conditional (job completion). Used within Glue Workflows to chain jobs.

Glue Studio

Visual drag-and-drop ETL job builder. Generates PySpark code underneath. Good for visual learners and rapid prototyping.

Job Bookmarks

Track which data has already been processed to enable incremental loads. Prevents reprocessing of old data on subsequent job runs.

Dynamic Frames

Glue's own distributed data structure (extends Spark DataFrame). Handles schema inconsistencies and nested data more gracefully than raw DataFrames.

FindMatches Transform

ML-powered deduplication and record matching transform. No ML expertise required — Glue trains the model from labeled examples.

Sensitive Data Detection

Automatically detect PII and sensitive data patterns in datasets during ETL jobs.

Lake Formation Integration

Glue Data Catalog is the metadata backbone of AWS Lake Formation. Lake Formation adds fine-grained column and row-level security on top.

VPC / Private Network Support

Glue jobs can run inside a VPC to access JDBC sources in private subnets (RDS, Redshift). Requires a Glue Connection with VPC/subnet/security group config.

Flex Execution (G.025X)

Uses spare AWS capacity at lower cost. Best for non-urgent, time-flexible batch jobs. Not suitable for SLA-sensitive workloads.

Auto Scaling for Glue Jobs

Glue can automatically scale the number of workers up and down during job execution based on workload. Reduces over-provisioning costs.

Built-in Visualization / Dashboards

Glue has NO visualization capability. Use Amazon QuickSight for dashboards on top of cataloged/processed data.

Real-time sub-second processing

Glue is a batch ETL service. Even Glue Streaming has seconds-to-minutes latency. Use Kinesis Data Analytics (Apache Flink) for true real-time.

Integration Patterns

S3 Data Lake ETL Pipeline

high freq

AWS GlueAmazon S3

Glue Crawlers scan S3 buckets to catalog raw data. Glue ETL jobs transform and clean the data, writing processed output back to S3 in optimized formats (Parquet, ORC). Athena or Redshift Spectrum query the cataloged S3 data. This is the foundational AWS data lake pattern.

Serverless Query on Cataloged Data

high freq

AWS GlueAmazon Athena

Glue Data Catalog serves as the shared metastore for Athena. Crawlers populate table definitions; Athena queries them directly via SQL. No data movement required — Athena reads directly from S3 using catalog metadata. Glue does NOT run Athena queries.

ETL to Data Warehouse

high freq

AWS GlueAmazon Redshift

Glue extracts data from operational sources (RDS, S3, DynamoDB), transforms it, and loads it into Amazon Redshift. Uses Glue's native Redshift connector with JDBC or the optimized Redshift Spark connector. Glue can also use Redshift Spectrum (via Data Catalog) for in-place querying.

Event-Driven ETL Trigger

high freq

AWS GlueAWS Lambda

S3 event notifications trigger Lambda, which starts a Glue ETL job via the Glue API (StartJobRun). Used for near-real-time batch processing when new files land in S3. Lambda handles the trigger logic; Glue handles the heavy transformation. Lambda cannot replace Glue for large-scale data processing.

ETL to BI Visualization

high freq

AWS GlueAmazon QuickSight

Glue prepares and transforms raw data, stores results in S3 or Redshift, which QuickSight then visualizes. Glue has NO built-in visualization. QuickSight cannot perform complex ETL — they are complementary services. This is the canonical 'prepare then visualize' pattern.

Governed Data Lake with Fine-Grained Access

high freq

AWS GlueAWS Lake Formation

Glue Data Catalog is the metadata backbone of Lake Formation. Lake Formation adds column-level and row-level security, data governance, and cross-account data sharing on top of the Glue catalog. Glue jobs register with Lake Formation to respect data access policies.

Glue Streaming ETL from Kinesis

medium freq

AWS GlueAmazon Kinesis Data Streams

Glue Streaming jobs consume records from Kinesis Data Streams continuously using Spark Structured Streaming. Suitable for seconds-to-minutes latency micro-batch processing. NOT suitable for millisecond real-time — use Kinesis Data Analytics (Flink) for that.

Audit Logging of Glue API Activity

medium freq

AWS GlueAWS CloudTrail

CloudTrail records all Glue API calls (StartJobRun, CreateTable, UpdateCrawler, etc.) for security auditing and compliance. This is operational logging — NOT a compliance certification or formal audit report. Glue itself does not generate compliance reports.

Advanced ETL Workflow Orchestration

medium freq

AWS GlueAWS Step Functions

Step Functions orchestrates Glue jobs alongside other AWS services (Lambda, ECS, SNS) for complex, conditional, or error-handling workflows. Preferred over Glue Workflows when cross-service orchestration, complex branching, or human approval steps are needed.

DynamoDB Export and Catalog

medium freq

AWS GlueAmazon DynamoDB

Glue Crawlers can catalog DynamoDB tables. Glue ETL jobs can read from DynamoDB (via export to S3 or direct connector) to transform and load data into analytics stores. DynamoDB is OLTP; Glue bridges it to OLAP systems.

Security Findings — NOT Compliance Reports

medium freq

AWS GlueAWS Security Hub

Security Hub aggregates security findings from Glue and other services. Important: Security Hub aggregation does NOT provide compliance certifications. It provides a security posture view. Candidates must not confuse security findings with formal compliance documentation.

Service Limits & Quotas

LimitValueNote

Maximum concurrent DPUs per account (Spark jobs)

Refer to AWS Service Quotas console — default limits vary by region DPUs

Candidates confuse DPU count with worker count. G.1X worker = 1 DPU, G.2X worker = 2 DPUs, G.025X (Flex) = 0.25 DPU. These are distinct worker types with different pricing.

Maximum number of databases per Data Catalog

Refer to AWS Service Quotas console for current regional default databases

The Data Catalog replaced the Hive Metastore concept — Athena, EMR, and Redshift Spectrum all use it as a shared metastore

Maximum number of tables per database in Data Catalog

Refer to AWS Service Quotas console for current regional default tables

Large-scale data lakes with thousands of tables are supported, but hitting catalog limits requires a Service Quotas increase request

Maximum concurrent Glue Crawler runs per account

Refer to AWS Service Quotas console concurrent crawlers

Crawlers can be scheduled (cron), on-demand, or triggered by events. They automatically create/update table definitions in the Data Catalog. Running too many simultaneously may require quota increases.

Minimum DPUs for a Spark ETL job

2 DPUs

Python Shell jobs support 0.0625 DPU (1/16 DPU) or 1 DPU — much cheaper for lightweight scripts that don't need Spark

Maximum job timeout

2880 minutes (48 hours)

Default timeout is 2880 minutes (48 hours) for Spark jobs. Python Shell default is 60 minutes. These are configurable.

Maximum number of partitions returned by a single Crawler run

Refer to AWS Service Quotas console partitions

MSCK REPAIR TABLE in Athena or Glue Crawlers are both used to add new partitions — candidates confuse which is automated

Glue version support

Glue 4.0 (latest as of 2024, supports Spark 3.3, Python 3.10)

Glue 2.0 introduced 1-minute billing granularity (vs. 10-minute minimum in Glue 1.0). This is a significant cost difference for short-running jobs.

Billing granularity for Spark jobs (Glue 2.0+)

1 minute minimum, billed per second after

Glue 1.0 = 10-minute billing minimum. Glue 2.0+ = 1-minute minimum. This is a classic exam trap on cost optimization questions.

Glue DataBrew — maximum number of datasets per account

Refer to AWS Service Quotas console datasets

DataBrew is a distinct product within the Glue family — it targets data analysts who want visual data cleaning without writing code

Glue Streaming job — checkpoint interval

Configurable; uses Spark Structured Streaming checkpointing

Glue Streaming jobs read from Kinesis Data Streams or Kafka (MSK) continuously. They are NOT the same as Kinesis Data Analytics. Glue Streaming has higher latency (seconds to minutes) vs. Flink (milliseconds).

Pricing Model

Pay-per-use: billed on DPU-hours consumed by ETL jobs, crawler runtime, and Data Catalog storage/requests

Glue ETL Jobs: Billed per DPU-hour (1 DPU = 4 vCPUs + 16 GB RAM). Glue 2.0+ has 1-second billing granularity with 1-minute minimum. Glue 1.0 had a 10-minute minimum — always use 2.0+ for cost savings on short jobs.
Glue Crawlers: Billed per DPU-hour consumed during crawl. Crawlers use 2 DPUs by default.
Glue Data Catalog: First 1 million objects (databases, tables, partitions) stored free per month. First 1 million requests free per month. Charges apply above free tier.
Glue DataBrew: Billed per DataBrew node-hour for interactive sessions and job runs — separate pricing from Glue ETL.
Glue Data Quality: Billed per DPU-hour for data quality evaluations.
Flex Execution (G.025X): Lower cost than standard workers by using spare capacity — ideal for cost-sensitive, non-urgent batch jobs.
Auto Scaling: Can reduce costs by dynamically right-sizing worker count during job execution — no manual tuning needed.
No charge for Glue Studio (visual interface) — only the underlying job execution is billed.
Development endpoints (deprecated): Previously billed per DPU-hour even when idle. AWS recommends migrating to Glue Studio or interactive sessions instead.

Exam Tips

criticalGlue Streaming vs. Kinesis Data Analytics

AWS Glue is a BATCH ETL service, not a real-time streaming service. Even Glue Streaming Jobs (which exist) have seconds-to-minutes latency using Spark Structured Streaming. For true real-time/sub-second processing, the answer is Kinesis Data Analytics (Apache Flink), NOT Glue.

criticalGlue + QuickSight separation of concerns

Glue has NO built-in visualization or dashboards. If an exam scenario asks about ETL + visualization, Glue handles the ETL and Amazon QuickSight handles visualization. Never select Glue as the answer for dashboards or BI reports.

criticalData Catalog scoping and cross-account sharing

The Glue Data Catalog is ONE per AWS account per region. It is the shared metastore for Athena, EMR, Redshift Spectrum, and Lake Formation. Cross-account catalog access requires AWS Lake Formation resource-based policies — not VPC peering or IAM alone.

criticalIncremental ETL with Job Bookmarks

Job Bookmarks enable INCREMENTAL processing — Glue tracks which data has been processed and only processes new data on subsequent runs. Without bookmarks, every job run reprocesses all data. This is the key feature for cost optimization in recurring ETL jobs.

criticalGlue billing granularity by version

Glue 2.0+ introduced 1-second billing granularity with a 1-minute minimum. Glue 1.0 had a 10-minute minimum billing window. For cost optimization questions, always prefer Glue 2.0+ (or the latest version) for short-running jobs.

critical

Glue is BATCH ETL — NOT real-time. Glue Streaming still uses micro-batching (seconds latency). For sub-second real-time, the answer is ALWAYS Kinesis Data Analytics (Apache Flink), never Glue.

critical

Glue has ZERO visualization capability. ETL pipeline questions requiring dashboards need BOTH Glue (transform) AND QuickSight (visualize). Never choose Glue as the answer for visualization requirements.

critical

CloudTrail logs Glue API calls = operational audit logs. AWS Config rules = configuration compliance monitoring. Security Hub = security findings aggregation. NONE of these = formal compliance certifications. For compliance certs, use AWS Artifact.

importantDynamicFrame vs DataFrame

Dynamic Frames vs. Spark DataFrames: Glue's DynamicFrame handles semi-structured data and schema inconsistencies (e.g., a column that is sometimes a string, sometimes an int). Convert to DataFrame for standard Spark operations, then back to DynamicFrame for Glue sinks.

importantCrawler capabilities and limitations

Glue Crawlers infer schemas and update the Data Catalog automatically. However, they do NOT query or transform data — they only read metadata. For adding partitions to an existing Athena table, you can use either a Crawler or MSCK REPAIR TABLE — but Crawlers also handle schema evolution.

importantGlue VPC Connections

For VPC-based data sources (RDS in private subnet, Redshift in VPC), Glue requires a Glue Connection configured with VPC, subnet, and security group. The security group must allow self-referencing inbound rules for Glue to function. This is a common architecture question.

importantCloudTrail + Glue audit logging vs. compliance

CloudTrail logs Glue API activity for auditing (who started a job, who modified the catalog). This is OPERATIONAL LOGGING — not a compliance certification, not a formal audit report, and not a substitute for AWS Artifact compliance documents.

importantFindMatches ML Transform

Glue FindMatches is an ML transform for deduplication — it learns from labeled examples you provide. It does NOT require ML expertise. On exam questions about deduplication in ETL pipelines, FindMatches is the AWS-native answer.

importantFlex Execution cost optimization

Flex Execution (G.025X worker type) uses spare AWS capacity at a discount. Use it for non-urgent, time-insensitive batch jobs to reduce costs. Do NOT use Flex for SLA-critical or time-sensitive ETL pipelines — job start may be delayed.

importantPython Shell vs. Spark job selection

Python Shell jobs in Glue are for lightweight scripts (small datasets, API calls, simple transformations). They use 0.0625 DPU or 1 DPU — far cheaper than Spark jobs (minimum 2 DPUs). For cost-optimization questions involving simple scripts, Python Shell is the right answer.

Good to KnowGlue DataBrew vs. Glue ETL

AWS Glue DataBrew is a separate, no-code visual data preparation tool within the Glue family. It is NOT the same as Glue ETL jobs. DataBrew targets business analysts; Glue ETL targets data engineers. They have different pricing models.

Common Misconceptions & Traps

Common Mistake

AWS Glue can process data in real-time with sub-second latency, making it suitable for real-time analytics pipelines

Correct

AWS Glue is fundamentally a BATCH ETL service. Even Glue Streaming Jobs (which continuously read from Kinesis or Kafka) use Spark Structured Streaming micro-batching with seconds-to-minutes latency — NOT sub-second real-time. For true real-time processing (<1 second), use Kinesis Data Analytics (Apache Flink).

This is the #1 Glue misconception on certification exams. Exam questions will describe a scenario requiring 'real-time' or 'sub-second' processing and list Glue as an option. Always eliminate Glue for true real-time requirements. The word 'streaming' in 'Glue Streaming' does not mean real-time — it means continuous batch micro-processing.

Common Mistake

AWS Glue includes built-in dashboards and visualization so you can see your transformed data immediately after ETL

Correct

AWS Glue has absolutely NO visualization or dashboard capability. It is purely a data integration and transformation service. After Glue processes data, you need a separate visualization tool — Amazon QuickSight for BI dashboards, or Athena for SQL queries on the results.

Exam questions about end-to-end analytics pipelines will test whether you know to separate ETL (Glue) from visualization (QuickSight). A common trap answer pairs Glue with a visualization requirement. Remember: Glue = Transform, QuickSight = Visualize — they are always used together, never interchangeably.

Common Mistake

CloudTrail logging of Glue API calls provides compliance certification and formal audit reports for regulatory requirements

Correct

CloudTrail records Glue API activity as OPERATIONAL LOGS — who called which API, when, from where. This is useful for security investigation and operational auditing, but it is NOT a compliance certification, NOT a formal audit report, and NOT equivalent to AWS Artifact compliance documents. Compliance certifications come from AWS Artifact (SOC reports, PCI DSS attestations, etc.).

Exam questions in the security/governance domain frequently present CloudTrail as a compliance solution. The correct answer is that CloudTrail provides audit trails (operational logs) while formal compliance certifications are obtained through AWS Artifact. Security Hub aggregates findings but also does not provide certifications.

Common Mistake

AWS Config monitoring of Glue resources equals formal compliance monitoring and generates compliance reports

Correct

AWS Config tracks configuration changes and compliance AGAINST RULES you define (e.g., 'Glue jobs must use encryption'). Config rules tell you if a resource is compliant with your internal policies — this is configuration compliance monitoring, NOT regulatory compliance certification. It does not generate SOC 2, HIPAA, or PCI DSS reports.

Candidates confuse 'Config compliance rules' with 'regulatory compliance certification.' Config is about enforcing your configuration standards. Regulatory compliance certifications require AWS Artifact. This distinction appears in exam questions about governance and compliance frameworks.

Common Mistake

Security Hub aggregating Glue security findings means your Glue environment is certified as compliant with security standards

Correct

Security Hub aggregates security FINDINGS from GuardDuty, Inspector, Macie, and other services — including Glue-related findings. It provides a unified security posture view and maps findings to frameworks like CIS, NIST, and PCI DSS. However, aggregating findings does NOT certify compliance. It identifies gaps; it does not issue certifications.

This misconception appears in exam questions that ask about achieving compliance certifications. Security Hub is a security findings aggregator and posture management tool — not a compliance certifier. The correct answer for formal compliance certifications is always AWS Artifact.

Common Mistake

Glue Crawlers transform and clean data as they discover it, making ETL jobs unnecessary for simple use cases

Correct

Glue Crawlers ONLY read metadata to infer schemas and populate the Data Catalog. They do NOT transform, clean, filter, or modify the underlying data in any way. ETL jobs are always required for actual data transformation. Crawlers are purely a metadata discovery and cataloging mechanism.

This misconception leads candidates to underestimate Crawlers (thinking they do too little) or overestimate them (thinking they replace ETL). Crawlers = catalog metadata. ETL jobs = transform data. These are completely separate functions that complement each other.

Common Mistake

Glue ETL jobs and AWS Database Migration Service (DMS) are interchangeable for moving data between databases

Correct

DMS is purpose-built for live database-to-database migration with minimal downtime, supporting ongoing replication (CDC - Change Data Capture). Glue ETL is for batch transformation of data — not live migration. DMS preserves transactional integrity during migration; Glue does not. For migrating a production RDS database with minimal downtime, use DMS. For transforming and loading historical data into a data warehouse, use Glue.

Exam questions about database migration scenarios will test this distinction. Key differentiator: DMS = live migration + CDC replication. Glue = batch ETL transformation. They can be used together (DMS for live migration, Glue for transforming historical data) but are not interchangeable.

Memory Tricks

🧠

GLUE = 'Grab, Label, Unify, Export' — Crawlers Grab data metadata, Data Catalog Labels and stores it, ETL jobs Unify/transform it, and jobs Export to target stores

🧠

Remember Glue's limitations with 'No VR': No Visualization, No Real-time (sub-second) — two things Glue absolutely cannot do

🧠

DPU math: 1 DPU = 4 vCPUs + 16 GB RAM. Think '4-16': 4 CPUs, 16 GB. Minimum 2 DPUs for Spark = 8 vCPUs + 32 GB minimum

🧠

Glue version billing: '1.0 = 10 minutes, 2.0+ = 1 minute' — upgrade versions to save money on short jobs

🧠

Catalog scope: 'One Catalog Per Region Per Account' — like one library per city branch, shared by all readers (Athena, EMR, Redshift Spectrum)

CertAI Tutor · DEA-C01, SAA-C03, SAP-C02, CLF-C02 · 2026-02-22

Ready to test your knowledge?

Practice DEA-C01, SAA-C03, SAP-C02, CLF-C02 exam questions with AI-powered explanations — free to start.

AWS Glue: The Serverless ETL Powerhouse

Overview

Key Features

Integration Patterns

Service Limits & Quotas

Pricing Model

Exam Tips

Common Misconceptions & Traps

Memory Tricks

Ready to test your knowledge?

Related Cheat Sheets