ml ai

ML Pipeline on AWS: From Raw Data to Production Model

Master the end-to-end ML lifecycle on AWS — the services, patterns, and decisions that separate passing candidates from failing ones.

Updated 2026-02-22

Overview

An ML pipeline on AWS is the orchestrated sequence of steps — data ingestion, preprocessing, feature engineering, model training, evaluation, and deployment — that transforms raw data into a production-ready, continuously monitored machine learning model. AWS provides a layered ecosystem of services (SageMaker, Glue, Step Functions, Kinesis, and more) that can be composed into fully automated pipelines. For certification exams, understanding WHICH service handles WHICH stage, and WHY, is the single most tested concept in the ML/AI domain.

Exam questions present a business scenario with specific constraints (latency, cost, scale, real-time vs. batch) and ask you to select the correct AWS service or architecture pattern for a given ML pipeline stage. Knowing the purpose, boundaries, and integration points of each service is essential to answering these correctly.

Patterns & Strategies

Fully Managed SageMaker Pipeline (End-to-End)

Use Amazon SageMaker Pipelines to define, automate, and version the entire ML workflow — data processing, training, evaluation, model registration, and conditional deployment — as a DAG (Directed Acyclic Graph) of steps. SageMaker Model Registry tracks model versions and approval status. SageMaker Experiments tracks runs and metrics. CI/CD is achieved via CodePipeline triggering SageMaker Pipelines.

✓

When the team wants a single, unified ML platform with minimal operational overhead. Ideal for teams already using SageMaker for training and hosting. Best for iterative experimentation + production promotion workflows.

⚠

Higher SageMaker-specific lock-in. Costs scale with managed infrastructure. Less flexibility for custom orchestration logic compared to Step Functions.

Step Functions + Glue + SageMaker (Decoupled Orchestration)

AWS Step Functions orchestrates the pipeline as a state machine, calling AWS Glue jobs for ETL/data preparation, Amazon SageMaker for training and inference, and Lambda for lightweight transformations or routing logic. Each stage is a discrete, independently scalable service. EventBridge can trigger the pipeline on schedule or on S3 data arrival.

✓

When you need fine-grained control over error handling, retries, branching logic, and timeouts. Ideal when the data engineering team (Glue) and ML team (SageMaker) are separate. Best for complex workflows that mix ML with non-ML steps (e.g., sending SNS notifications, updating DynamoDB).

⚠

More components to manage. State machine definitions can become complex. Debugging requires CloudWatch Logs across multiple services.

Real-Time Streaming ML Pipeline (Kinesis + Lambda + SageMaker Endpoint)

Kinesis Data Streams or Kinesis Data Firehose ingests real-time event data. Lambda (or Kinesis Data Analytics / Apache Flink) performs feature extraction on the stream. A SageMaker real-time inference endpoint serves predictions with millisecond latency. Results are written to DynamoDB, S3, or pushed back to Kinesis for downstream consumers.

✓

When predictions must be made on live, streaming data — fraud detection, real-time recommendations, anomaly detection on IoT telemetry. Latency requirements are sub-second to single-digit seconds.

⚠

Higher cost due to always-on SageMaker endpoint. Requires careful feature engineering to work with partial/streaming data. Monitoring data drift on streaming data is more complex.

Batch Inference Pipeline (S3 + Glue + SageMaker Batch Transform)

Data lands in S3 (from Glue ETL, Athena, or direct upload). SageMaker Batch Transform runs inference on the entire dataset without a persistent endpoint — the model container spins up, processes all records, writes predictions back to S3, and shuts down. Results are consumed downstream by Athena, QuickSight, or application databases.

✓

When predictions are needed on large, static datasets and real-time latency is not required. Examples: nightly churn scoring, weekly product recommendations, monthly risk assessments. Cost-optimized — no idle endpoint costs.

⚠

Not suitable for real-time use cases. Results are only as fresh as the last batch run. Startup time for the transform job adds latency.

Feature Store-Centric Pipeline (SageMaker Feature Store)

A centralized Feature Store (SageMaker Feature Store) acts as the single source of truth for features used in both training and inference. The offline store (S3-backed) is used for training dataset creation. The online store (low-latency, managed) is used for real-time inference feature lookup. This eliminates training-serving skew — the #1 source of ML production failures.

✓

When multiple teams share features across multiple models. When consistency between training features and serving features is critical. When features are expensive to compute and should be reused.

⚠

Adds architectural complexity. Feature pipelines must be maintained to keep the store fresh. Online store has costs proportional to read/write throughput.

MLOps with CI/CD and Model Monitoring (SageMaker + CodePipeline + Model Monitor)

CodeCommit/GitHub triggers CodePipeline on code/data changes. CodeBuild runs unit tests and packages training code. SageMaker Pipelines retrains the model. SageMaker Model Registry gates promotion to production with manual or automated approval. SageMaker Model Monitor continuously checks for data drift, model quality drift, bias drift, and feature attribution drift on the live endpoint. CloudWatch Alarms trigger retraining when drift thresholds are breached.

✓

Production ML systems where model performance degrades over time (concept drift, data drift). Regulated industries requiring audit trails of model versions and approvals. Teams practicing MLOps maturity level 2+ (automated retraining pipelines).

⚠

Significant upfront setup investment. Requires DevOps + ML expertise. Monitoring baselines must be established from training data and kept current.

Decision Framework

STEP 1 — What is the prediction latency requirement? → Sub-second real-time:

• Use SageMaker Real-Time Endpoint (+ Kinesis for streaming ingestion)

→ Async, minutes acceptable: Use SageMaker Asynchronous Inference

→ Batch, hours acceptable: Use SageMaker Batch Transform (no persistent endpoint)

STEP 2 — What is the data ingestion pattern? → Streaming events:

• Kinesis Data Streams → Lambda/KDA → Feature extraction → SageMaker Endpoint

→ Batch files in S3: EventBridge (schedule/S3 event) → Glue ETL → SageMaker Training/Batch Transform

→ Database records: DMS or Glue JDBC → S3 → SageMaker

STEP 3 — Who orchestrates the pipeline? → ML-native, simple DAG:

• SageMaker Pipelines

→ Complex branching, multi-service, non-ML steps: AWS Step Functions

→ Event-driven, loosely coupled: EventBridge + Lambda + SageMaker

STEP 4 — Is feature consistency (training vs. serving) a requirement? → Yes:

• SageMaker Feature Store (offline for training, online for serving)

→ No: Direct S3 feature files for training; compute features at inference time

STEP 5 — Is model governance/auditability required? → Yes:

• SageMaker Model Registry (versioning + approval workflow) + SageMaker Experiments

→ No: Direct S3 model artifact storage

STEP 6 — Is ongoing monitoring required? → Yes:

• SageMaker Model Monitor (data quality, model quality, bias, feature attribution)

→ No: CloudWatch metrics on endpoint (basic)

STEP 7 — Cost optimization? → Sporadic workloads:

• Batch Transform (no idle cost) or Serverless Inference

→ Steady-state traffic: Real-Time Endpoint with Auto Scaling

→ Spiky/unpredictable: SageMaker Serverless Inference or Asynchronous Inference

Exam Tips

criticalSageMaker inference modes

SageMaker Batch Transform does NOT require a persistent endpoint — it spins up compute, runs inference on all records in S3, writes results back to S3, and terminates. This is the cost-optimized answer for any scenario where real-time latency is not required.

criticalSageMaker Feature Store

Training-serving skew (using different feature logic at training time vs. inference time) is solved by SageMaker Feature Store. If an exam scenario describes inconsistent model predictions in production despite good training metrics, Feature Store is the architectural answer.

criticalPipeline orchestration

SageMaker Pipelines is the AWS-native MLOps orchestrator for ML workflows. Step Functions is the general-purpose orchestrator. If the question asks specifically about orchestrating ML steps with built-in experiment tracking and model registry integration, SageMaker Pipelines wins. If the workflow mixes ML with non-ML AWS services heavily, Step Functions wins.

criticalSageMaker Model Monitor

SageMaker Model Monitor has FOUR monitor types: (1) Data Quality — detects data drift vs. training baseline, (2) Model Quality — detects prediction quality degradation, (3) Bias Drift — detects fairness drift using Clarify, (4) Feature Attribution Drift — detects SHAP value shifts. Exam questions will name a specific drift type and ask which monitor to configure.

criticalAWS AI Services (Layer 1)

Amazon Rekognition, Comprehend, Textract, Transcribe, Polly, Translate, and Forecast are fully managed AI services — they require NO ML expertise, NO model training, and NO infrastructure management. If an exam scenario says 'the team has no ML expertise,' these pre-trained services are always preferred over SageMaker custom training.

critical

Match the inference mode to latency: Real-time endpoint = milliseconds (persistent, always-on); Asynchronous = minutes, large payloads up to 1 GB; Batch Transform = hours, no persistent endpoint, lowest cost; Serverless = variable, cold starts, zero idle cost. Getting this mapping wrong is the #1 source of wrong answers on ML architecture questions.

critical

If the scenario says 'no ML expertise required' or describes a standard vision/NLP/speech task, the answer is ALWAYS a pre-built AWS AI service (Rekognition, Comprehend, Textract, Transcribe, etc.) — NOT SageMaker custom training. Layer selection is tested constantly.

critical

SageMaker Feature Store solves training-serving skew. If a question describes a model that performed well in training/testing but poorly in production with no obvious bug, the root cause is feature inconsistency and the solution is Feature Store.

importantAWS Glue vs. Glue DataBrew

AWS Glue is the go-to ETL service for ML data preparation at scale. Glue Data Catalog acts as the metadata repository. Glue DataBrew is the no-code/low-code data preparation tool for analysts. If a question mentions 'data scientists who are not programmers need to clean data,' DataBrew is the answer, not Glue ETL jobs.

importantSageMaker Asynchronous Inference

Amazon SageMaker Asynchronous Inference is designed for large payloads (up to 1 GB) and long processing times (up to 15 minutes). It queues requests and returns results to S3. Use this when the input is a large document, video, or audio file that would timeout a synchronous endpoint.

importantSageMaker Serverless Inference

SageMaker Serverless Inference automatically scales to zero when there are no requests (no idle cost) and scales up on demand. It has cold start latency. Use it for intermittent/unpredictable traffic. Do NOT recommend it for latency-sensitive production applications with consistent traffic.

importantSageMaker Ground Truth

SageMaker Ground Truth is the human-in-the-loop data labeling service. It uses active learning to automatically label easy examples and routes ambiguous examples to human labelers (via Mechanical Turk, private workforce, or vendor workforce). If a question asks about labeling training data at scale with cost optimization, Ground Truth is the answer.

Good to KnowStreaming ML pipeline

Amazon Kinesis Data Firehose can invoke a Lambda function to transform streaming records before delivering to S3/Redshift/OpenSearch. This is a lightweight feature engineering step in a streaming ML pipeline. Kinesis Data Streams → Lambda → SageMaker Endpoint is the real-time inference pattern.

Common Misconceptions & Traps

Common Mistake

SageMaker handles everything in an ML pipeline — you just use SageMaker for all steps.

Correct

SageMaker is the core ML platform (training, tuning, deployment, monitoring), but a complete ML pipeline typically integrates multiple AWS services: S3 for storage, Glue for ETL, Kinesis for streaming ingestion, Step Functions or EventBridge for orchestration, CodePipeline for CI/CD, and CloudWatch for operational monitoring. No single service covers all stages.

Exam scenarios are designed to test whether you know which service is responsible for which stage. Saying 'SageMaker does it all' will lead you to wrong answers on data ingestion, ETL, and orchestration questions.

Common Mistake

You must train a custom model in SageMaker for every ML use case on AWS.

Correct

AWS has three layers: (1) Pre-trained AI Services (Rekognition, Comprehend, Textract, etc.) — no ML knowledge required; (2) SageMaker with built-in algorithms or AutoML (SageMaker Autopilot) — some ML knowledge; (3) Custom models with custom containers — full ML expertise. The exam heavily tests choosing the RIGHT layer for the scenario's stated expertise and requirements.

Candidates with ML backgrounds instinctively reach for custom training. But if the scenario says 'no ML expertise' or 'standard NLP/vision task,' the pre-trained AI service is always the correct, cost-effective answer.

Common Mistake

SageMaker Model Monitor detects when a model's accuracy drops, so you don't need to track predictions and ground truth separately.

Correct

Model Quality monitoring DOES require ground truth labels to be merged with captured predictions. You must set up a ground truth ingestion pipeline (predictions are captured automatically, but actual labels must be provided by your application). Without ground truth, only Data Quality (input distribution drift) can be monitored automatically.

This is a common trap. Candidates assume Model Monitor is fully automatic. In reality, model quality monitoring requires operational work to supply ground truth labels, which is a real architectural consideration.

Common Mistake

AWS Glue and AWS Data Pipeline are interchangeable for ML data preparation.

Correct

AWS Data Pipeline is a legacy orchestration service for data movement. AWS Glue is the modern, serverless ETL service with a Data Catalog, built-in Spark support, and native integration with SageMaker, Athena, and Redshift. For new ML pipelines, Glue is always the correct choice. AWS Data Pipeline is essentially deprecated for new workloads.

Older study materials mention Data Pipeline. Exam scenarios describing ETL for ML should map to Glue, not Data Pipeline.

Common Mistake

SageMaker Pipelines and AWS Step Functions are the same thing — just pick either one for ML orchestration.

Correct

SageMaker Pipelines is ML-specific: it natively integrates with SageMaker training, processing, tuning, and model registry steps. It tracks lineage and experiments automatically. Step Functions is general-purpose and requires you to manually wire SageMaker API calls as tasks. SageMaker Pipelines is the right answer when the workflow is ML-centric; Step Functions when the workflow is mixed or requires complex branching logic beyond ML.

Exam questions often provide both as options. The discriminator is always the complexity of non-ML steps and the need for native ML lineage tracking.

Common Mistake

Training-serving skew is a deployment problem fixed by better testing before release.

Correct

Training-serving skew is an architectural problem caused by computing features differently during training (batch, offline) vs. serving (real-time, online). The fix is architectural — using SageMaker Feature Store to ensure the exact same feature values are used in both contexts. Testing alone cannot catch this if the feature computation code paths are different.

This misconception leads candidates to recommend CodeDeploy or blue/green deployments when the correct answer is Feature Store. Recognizing the root cause (feature inconsistency) is key.

Memory Tricks

🧠

IDEAL Pipeline Stages: Ingest → Data Prep → Engineer Features → Algorithm Train → Launch & Monitor (I-DEAL)

🧠

Three AWS ML Layers: 'APE' — AI Services (pre-trained), Platform (SageMaker AutoML/built-ins), Expert (custom containers)

🧠

SageMaker Inference Modes: 'RABS' — Real-time, Asynchronous, Batch Transform, Serverless. Match latency need to mode.

🧠

Model Monitor drift types: 'DBMF' — Data quality, Bias, Model quality, Feature attribution. 'Don't Blindly Monitor Features' alone.

🧠

When to use Feature Store: 'TRUST' — Training-serving skew, Reuse across teams, Unified feature definitions, Shared governance, Time-travel for point-in-time correct features

Common Trap

Choosing SageMaker Real-Time Endpoint for a batch scoring scenario (e.g., 'score 10 million records nightly'). The correct answer is SageMaker Batch Transform — it has no persistent endpoint, is dramatically cheaper for batch workloads, and is specifically designed for this use case. Candidates confuse 'needing a model' with 'needing an endpoint.'

CertAI Tutor · · 2026-02-22

Ready to test your knowledge?

Practice exam questions with AI-powered explanations — free to start.

ML Pipeline on AWS: From Raw Data to Production Model

Overview

Patterns & Strategies

Decision Framework

Exam Tips

Common Misconceptions & Traps

Memory Tricks

Common Trap

Ready to test your knowledge?

Related Cheat Sheets