
Cargando...
Fully managed, serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources for analytics and ML
AWS Glue is a fully serverless data integration service that provides a unified platform for discovering, cataloging, cleaning, transforming, and moving data across data stores. It eliminates infrastructure management by automatically provisioning, configuring, and scaling the resources needed to run ETL jobs. Glue supports batch ETL workloads natively — it is NOT a real-time streaming service — and integrates deeply with the broader AWS analytics ecosystem including S3, Redshift, Athena, and EMR.
Automate and simplify the Extract, Transform, and Load (ETL) process for batch data pipelines without managing servers, making data ready for analytics and ML workloads
Use When
Avoid When
Glue Data Catalog
Centralized, persistent metadata repository. Acts as the Hive Metastore for Athena, EMR, and Redshift Spectrum. One catalog per account per region.
Glue Crawlers
Automatically scan data stores, infer schemas, and populate the Data Catalog. Support S3, JDBC, DynamoDB, DocumentDB, MongoDB, and more.
Glue ETL Jobs (Apache Spark)
Serverless Spark jobs. Auto-generates PySpark or Scala code. Supports G.1X, G.2X, G.4X, G.8X, and G.025X (Flex) worker types.
Glue Python Shell Jobs
Lightweight Python scripts without Spark. Ideal for small datasets, API calls, or orchestration logic. Uses 0.0625 or 1 DPU.
Glue Streaming ETL
Continuous ETL from Kinesis Data Streams or Apache Kafka (MSK). Built on Spark Structured Streaming. NOT for sub-second latency use cases.
Glue DataBrew
Visual, no-code data preparation tool with 250+ built-in transformations. Separate billing from Glue ETL.
Glue Data Quality
Define data quality rules using DQDL (Data Quality Definition Language). Evaluate rules during ETL jobs or as standalone runs.
Glue Workflows
Orchestrate complex ETL pipelines with multiple jobs and crawlers. Triggered by schedules, events, or on-demand. Alternative: AWS Step Functions for more complex orchestration.
Glue Triggers
Schedule-based, on-demand, or conditional (job completion). Used within Glue Workflows to chain jobs.
Glue Studio
Visual drag-and-drop ETL job builder. Generates PySpark code underneath. Good for visual learners and rapid prototyping.
Job Bookmarks
Track which data has already been processed to enable incremental loads. Prevents reprocessing of old data on subsequent job runs.
Dynamic Frames
Glue's own distributed data structure (extends Spark DataFrame). Handles schema inconsistencies and nested data more gracefully than raw DataFrames.
FindMatches Transform
ML-powered deduplication and record matching transform. No ML expertise required — Glue trains the model from labeled examples.
Sensitive Data Detection
Automatically detect PII and sensitive data patterns in datasets during ETL jobs.
Lake Formation Integration
Glue Data Catalog is the metadata backbone of AWS Lake Formation. Lake Formation adds fine-grained column and row-level security on top.
VPC / Private Network Support
Glue jobs can run inside a VPC to access JDBC sources in private subnets (RDS, Redshift). Requires a Glue Connection with VPC/subnet/security group config.
Flex Execution (G.025X)
Uses spare AWS capacity at lower cost. Best for non-urgent, time-flexible batch jobs. Not suitable for SLA-sensitive workloads.
Auto Scaling for Glue Jobs
Glue can automatically scale the number of workers up and down during job execution based on workload. Reduces over-provisioning costs.
Built-in Visualization / Dashboards
Glue has NO visualization capability. Use Amazon QuickSight for dashboards on top of cataloged/processed data.
Real-time sub-second processing
Glue is a batch ETL service. Even Glue Streaming has seconds-to-minutes latency. Use Kinesis Data Analytics (Apache Flink) for true real-time.
S3 Data Lake ETL Pipeline
high freqGlue Crawlers scan S3 buckets to catalog raw data. Glue ETL jobs transform and clean the data, writing processed output back to S3 in optimized formats (Parquet, ORC). Athena or Redshift Spectrum query the cataloged S3 data. This is the foundational AWS data lake pattern.
Serverless Query on Cataloged Data
high freqGlue Data Catalog serves as the shared metastore for Athena. Crawlers populate table definitions; Athena queries them directly via SQL. No data movement required — Athena reads directly from S3 using catalog metadata. Glue does NOT run Athena queries.
ETL to Data Warehouse
high freqGlue extracts data from operational sources (RDS, S3, DynamoDB), transforms it, and loads it into Amazon Redshift. Uses Glue's native Redshift connector with JDBC or the optimized Redshift Spark connector. Glue can also use Redshift Spectrum (via Data Catalog) for in-place querying.
Event-Driven ETL Trigger
high freqS3 event notifications trigger Lambda, which starts a Glue ETL job via the Glue API (StartJobRun). Used for near-real-time batch processing when new files land in S3. Lambda handles the trigger logic; Glue handles the heavy transformation. Lambda cannot replace Glue for large-scale data processing.
ETL to BI Visualization
high freqGlue prepares and transforms raw data, stores results in S3 or Redshift, which QuickSight then visualizes. Glue has NO built-in visualization. QuickSight cannot perform complex ETL — they are complementary services. This is the canonical 'prepare then visualize' pattern.
Governed Data Lake with Fine-Grained Access
high freqGlue Data Catalog is the metadata backbone of Lake Formation. Lake Formation adds column-level and row-level security, data governance, and cross-account data sharing on top of the Glue catalog. Glue jobs register with Lake Formation to respect data access policies.
Glue Streaming ETL from Kinesis
medium freqGlue Streaming jobs consume records from Kinesis Data Streams continuously using Spark Structured Streaming. Suitable for seconds-to-minutes latency micro-batch processing. NOT suitable for millisecond real-time — use Kinesis Data Analytics (Flink) for that.
Audit Logging of Glue API Activity
medium freqCloudTrail records all Glue API calls (StartJobRun, CreateTable, UpdateCrawler, etc.) for security auditing and compliance. This is operational logging — NOT a compliance certification or formal audit report. Glue itself does not generate compliance reports.
Advanced ETL Workflow Orchestration
medium freqStep Functions orchestrates Glue jobs alongside other AWS services (Lambda, ECS, SNS) for complex, conditional, or error-handling workflows. Preferred over Glue Workflows when cross-service orchestration, complex branching, or human approval steps are needed.
DynamoDB Export and Catalog
medium freqGlue Crawlers can catalog DynamoDB tables. Glue ETL jobs can read from DynamoDB (via export to S3 or direct connector) to transform and load data into analytics stores. DynamoDB is OLTP; Glue bridges it to OLAP systems.
Security Findings — NOT Compliance Reports
medium freqSecurity Hub aggregates security findings from Glue and other services. Important: Security Hub aggregation does NOT provide compliance certifications. It provides a security posture view. Candidates must not confuse security findings with formal compliance documentation.
AWS Glue is a BATCH ETL service, not a real-time streaming service. Even Glue Streaming Jobs (which exist) have seconds-to-minutes latency using Spark Structured Streaming. For true real-time/sub-second processing, the answer is Kinesis Data Analytics (Apache Flink), NOT Glue.
Glue has NO built-in visualization or dashboards. If an exam scenario asks about ETL + visualization, Glue handles the ETL and Amazon QuickSight handles visualization. Never select Glue as the answer for dashboards or BI reports.
The Glue Data Catalog is ONE per AWS account per region. It is the shared metastore for Athena, EMR, Redshift Spectrum, and Lake Formation. Cross-account catalog access requires AWS Lake Formation resource-based policies — not VPC peering or IAM alone.
Job Bookmarks enable INCREMENTAL processing — Glue tracks which data has been processed and only processes new data on subsequent runs. Without bookmarks, every job run reprocesses all data. This is the key feature for cost optimization in recurring ETL jobs.
Glue 2.0+ introduced 1-second billing granularity with a 1-minute minimum. Glue 1.0 had a 10-minute minimum billing window. For cost optimization questions, always prefer Glue 2.0+ (or the latest version) for short-running jobs.
Glue is BATCH ETL — NOT real-time. Glue Streaming still uses micro-batching (seconds latency). For sub-second real-time, the answer is ALWAYS Kinesis Data Analytics (Apache Flink), never Glue.
Glue has ZERO visualization capability. ETL pipeline questions requiring dashboards need BOTH Glue (transform) AND QuickSight (visualize). Never choose Glue as the answer for visualization requirements.
CloudTrail logs Glue API calls = operational audit logs. AWS Config rules = configuration compliance monitoring. Security Hub = security findings aggregation. NONE of these = formal compliance certifications. For compliance certs, use AWS Artifact.
Dynamic Frames vs. Spark DataFrames: Glue's DynamicFrame handles semi-structured data and schema inconsistencies (e.g., a column that is sometimes a string, sometimes an int). Convert to DataFrame for standard Spark operations, then back to DynamicFrame for Glue sinks.
Glue Crawlers infer schemas and update the Data Catalog automatically. However, they do NOT query or transform data — they only read metadata. For adding partitions to an existing Athena table, you can use either a Crawler or MSCK REPAIR TABLE — but Crawlers also handle schema evolution.
For VPC-based data sources (RDS in private subnet, Redshift in VPC), Glue requires a Glue Connection configured with VPC, subnet, and security group. The security group must allow self-referencing inbound rules for Glue to function. This is a common architecture question.
CloudTrail logs Glue API activity for auditing (who started a job, who modified the catalog). This is OPERATIONAL LOGGING — not a compliance certification, not a formal audit report, and not a substitute for AWS Artifact compliance documents.
Glue FindMatches is an ML transform for deduplication — it learns from labeled examples you provide. It does NOT require ML expertise. On exam questions about deduplication in ETL pipelines, FindMatches is the AWS-native answer.
Flex Execution (G.025X worker type) uses spare AWS capacity at a discount. Use it for non-urgent, time-insensitive batch jobs to reduce costs. Do NOT use Flex for SLA-critical or time-sensitive ETL pipelines — job start may be delayed.
Python Shell jobs in Glue are for lightweight scripts (small datasets, API calls, simple transformations). They use 0.0625 DPU or 1 DPU — far cheaper than Spark jobs (minimum 2 DPUs). For cost-optimization questions involving simple scripts, Python Shell is the right answer.
AWS Glue DataBrew is a separate, no-code visual data preparation tool within the Glue family. It is NOT the same as Glue ETL jobs. DataBrew targets business analysts; Glue ETL targets data engineers. They have different pricing models.
Common Mistake
AWS Glue can process data in real-time with sub-second latency, making it suitable for real-time analytics pipelines
Correct
AWS Glue is fundamentally a BATCH ETL service. Even Glue Streaming Jobs (which continuously read from Kinesis or Kafka) use Spark Structured Streaming micro-batching with seconds-to-minutes latency — NOT sub-second real-time. For true real-time processing (<1 second), use Kinesis Data Analytics (Apache Flink).
This is the #1 Glue misconception on certification exams. Exam questions will describe a scenario requiring 'real-time' or 'sub-second' processing and list Glue as an option. Always eliminate Glue for true real-time requirements. The word 'streaming' in 'Glue Streaming' does not mean real-time — it means continuous batch micro-processing.
Common Mistake
AWS Glue includes built-in dashboards and visualization so you can see your transformed data immediately after ETL
Correct
AWS Glue has absolutely NO visualization or dashboard capability. It is purely a data integration and transformation service. After Glue processes data, you need a separate visualization tool — Amazon QuickSight for BI dashboards, or Athena for SQL queries on the results.
Exam questions about end-to-end analytics pipelines will test whether you know to separate ETL (Glue) from visualization (QuickSight). A common trap answer pairs Glue with a visualization requirement. Remember: Glue = Transform, QuickSight = Visualize — they are always used together, never interchangeably.
Common Mistake
CloudTrail logging of Glue API calls provides compliance certification and formal audit reports for regulatory requirements
Correct
CloudTrail records Glue API activity as OPERATIONAL LOGS — who called which API, when, from where. This is useful for security investigation and operational auditing, but it is NOT a compliance certification, NOT a formal audit report, and NOT equivalent to AWS Artifact compliance documents. Compliance certifications come from AWS Artifact (SOC reports, PCI DSS attestations, etc.).
Exam questions in the security/governance domain frequently present CloudTrail as a compliance solution. The correct answer is that CloudTrail provides audit trails (operational logs) while formal compliance certifications are obtained through AWS Artifact. Security Hub aggregates findings but also does not provide certifications.
Common Mistake
AWS Config monitoring of Glue resources equals formal compliance monitoring and generates compliance reports
Correct
AWS Config tracks configuration changes and compliance AGAINST RULES you define (e.g., 'Glue jobs must use encryption'). Config rules tell you if a resource is compliant with your internal policies — this is configuration compliance monitoring, NOT regulatory compliance certification. It does not generate SOC 2, HIPAA, or PCI DSS reports.
Candidates confuse 'Config compliance rules' with 'regulatory compliance certification.' Config is about enforcing your configuration standards. Regulatory compliance certifications require AWS Artifact. This distinction appears in exam questions about governance and compliance frameworks.
Common Mistake
Security Hub aggregating Glue security findings means your Glue environment is certified as compliant with security standards
Correct
Security Hub aggregates security FINDINGS from GuardDuty, Inspector, Macie, and other services — including Glue-related findings. It provides a unified security posture view and maps findings to frameworks like CIS, NIST, and PCI DSS. However, aggregating findings does NOT certify compliance. It identifies gaps; it does not issue certifications.
This misconception appears in exam questions that ask about achieving compliance certifications. Security Hub is a security findings aggregator and posture management tool — not a compliance certifier. The correct answer for formal compliance certifications is always AWS Artifact.
Common Mistake
Glue Crawlers transform and clean data as they discover it, making ETL jobs unnecessary for simple use cases
Correct
Glue Crawlers ONLY read metadata to infer schemas and populate the Data Catalog. They do NOT transform, clean, filter, or modify the underlying data in any way. ETL jobs are always required for actual data transformation. Crawlers are purely a metadata discovery and cataloging mechanism.
This misconception leads candidates to underestimate Crawlers (thinking they do too little) or overestimate them (thinking they replace ETL). Crawlers = catalog metadata. ETL jobs = transform data. These are completely separate functions that complement each other.
Common Mistake
Glue ETL jobs and AWS Database Migration Service (DMS) are interchangeable for moving data between databases
Correct
DMS is purpose-built for live database-to-database migration with minimal downtime, supporting ongoing replication (CDC - Change Data Capture). Glue ETL is for batch transformation of data — not live migration. DMS preserves transactional integrity during migration; Glue does not. For migrating a production RDS database with minimal downtime, use DMS. For transforming and loading historical data into a data warehouse, use Glue.
Exam questions about database migration scenarios will test this distinction. Key differentiator: DMS = live migration + CDC replication. Glue = batch ETL transformation. They can be used together (DMS for live migration, Glue for transforming historical data) but are not interchangeable.
GLUE = 'Grab, Label, Unify, Export' — Crawlers Grab data metadata, Data Catalog Labels and stores it, ETL jobs Unify/transform it, and jobs Export to target stores
Remember Glue's limitations with 'No VR': No Visualization, No Real-time (sub-second) — two things Glue absolutely cannot do
DPU math: 1 DPU = 4 vCPUs + 16 GB RAM. Think '4-16': 4 CPUs, 16 GB. Minimum 2 DPUs for Spark = 8 vCPUs + 32 GB minimum
Glue version billing: '1.0 = 10 minutes, 2.0+ = 1 minute' — upgrade versions to save money on short jobs
Catalog scope: 'One Catalog Per Region Per Account' — like one library per city branch, shared by all readers (Athena, EMR, Redshift Spectrum)
CertAI Tutor · DEA-C01, SAA-C03, SAP-C02, CLF-C02 · 2026-02-22