ml aiAIF-C01SAA-C03SAP-C02CLF-C02

Amazon Comprehend: The NLP Intelligence Engine

Unlock meaning from text — entity extraction, sentiment, language detection, and custom ML models without writing a line of ML code

Updated 2026-02-22

Overview

Amazon Comprehend is a fully managed Natural Language Processing (NLP) service that uses machine learning to find insights and relationships in unstructured text — including entities, key phrases, sentiment, language, syntax, and topics. It requires zero ML expertise to use pre-trained models, and supports custom entity recognition and custom classification for domain-specific needs. Comprehend is purpose-built for text analytics at scale, integrating natively with S3, Lambda, Kinesis, and other AWS services for real-time and batch pipelines.

Extract structured insights (entities, sentiment, topics, language) from unstructured text at scale using pre-trained or custom NLP models, enabling downstream analytics, automation, and responsible AI workflows

Use When

Analyzing customer feedback, reviews, or support tickets for sentiment and key themes at scale
Extracting medical entities (diagnoses, medications, dosages) from clinical notes using Comprehend Medical
Building a custom document classifier to route incoming emails or documents to the correct team
Detecting PII (Personally Identifiable Information) in text before storing or processing sensitive data
Performing topic modeling on large document corpora to discover latent themes without labeled data
Identifying the dominant language of user-submitted content in a multilingual application

Avoid When

When you need to analyze images or video for content — use Amazon Rekognition instead; Comprehend is text-only
When you need conversational AI or chatbot capabilities — use Amazon Lex; Comprehend does not manage dialog flows
When your use case requires real-time speech transcription — use Amazon Transcribe first, then optionally pipe text to Comprehend
When you need generative AI or large language model (LLM) capabilities — use Amazon Bedrock; Comprehend is discriminative NLP, not generative
When you need to extract structured data from forms, tables, or scanned PDFs — use Amazon Textract for document layout understanding, then optionally Comprehend for NLP enrichment

Key Features

Sentiment Analysis (positive/negative/neutral/mixed)

Returns a dominant sentiment plus confidence scores for all four categories

Entity Recognition (built-in types: PERSON, LOCATION, ORG, DATE, etc.)

Pre-trained; no labeling required

Custom Entity Recognition

Train on your own entity types (e.g., product codes, internal IDs)

Custom Document Classification

Multi-class and multi-label modes supported

Key Phrase Extraction

Identifies noun phrases that are key to document meaning

Language Detection

100+ languages; returns dominant language with confidence score

Syntax Analysis (POS tagging)

Part-of-speech tagging: nouns, verbs, adjectives, etc.

Topic Modeling (LDA)

Unsupervised; async only; you specify number of topics (1–100)

PII Detection and Redaction

Identifies and optionally redacts PII types (SSN, credit card, email, phone, etc.)

Targeted Sentiment

Sentiment tied to specific entities within a document, not just document-level

Comprehend Medical (separate service)

Specialized for clinical/medical text; extracts medical entities, ICD-10-CM, RxNorm codes

Real-time (synchronous) analysis

Low-latency, single-document; 5,000-byte limit

Asynchronous batch analysis (S3)

Large-scale; up to 1 MB per doc; results written to S3

Flywheel (active learning pipeline)

Automates model retraining with new data; reduces ongoing MLOps burden

VPC support / PrivateLink

Comprehend endpoints can be accessed privately without internet egress

KMS encryption for training data and model artifacts

Customer-managed KMS keys supported for data at rest

IAM-based access control

Fine-grained resource-level permissions for endpoints and jobs

Amazon Comprehend endpoints (real-time custom model hosting)

Deploy custom models as persistent endpoints for synchronous inference

Built-in bias detection

CRITICAL MISCONCEPTION: Comprehend does NOT have built-in algorithmic bias detection. Use SageMaker Clarify for bias detection.

Integration Patterns

Document Intelligence Pipeline

high freq

Amazon ComprehendAmazon Textract

Textract extracts raw text and structured data (tables, forms) from scanned PDFs/images, then Comprehend performs NLP enrichment (entities, sentiment, PII detection) on the extracted text. Common for processing invoices, contracts, medical records, and legal documents.

Event-Driven Text Analytics

high freq

Amazon ComprehendAWS Lambda

Lambda invokes Comprehend synchronous APIs in real time as text arrives (from API Gateway, SQS, Kinesis, or S3 triggers). Enables per-record NLP enrichment in serverless pipelines without managing infrastructure.

Batch NLP Analytics Data Lake

high freq

Amazon ComprehendAmazon S3AWS GlueAmazon Athena

Raw text files land in S3; async Comprehend batch jobs process them and write JSON results back to S3; Glue crawlers catalog the output; Athena queries insights. Ideal for large-scale document analytics without real-time requirements.

Real-Time Sentiment Streaming

high freq

Amazon ComprehendAmazon Kinesis Data StreamsAmazon OpenSearch Service

Social media or customer feedback streams through Kinesis; Lambda enriches each record with Comprehend sentiment scores; enriched records are indexed in OpenSearch for real-time dashboards and alerting.

Intent + Sentiment Augmented Chatbot

high freq

Amazon ComprehendAmazon Lex

Lex handles dialog management and intent classification; Comprehend adds sentiment analysis on user utterances to detect frustrated customers and trigger escalation paths or agent handoff. These are complementary, not competing services.

Custom NLP Model Development Pipeline

high freq

Amazon ComprehendAmazon SageMaker

SageMaker is used for advanced custom NLP model development, hyperparameter tuning, and experiment tracking, while Comprehend custom models offer a simpler managed path. For bias detection in Comprehend-generated outputs, SageMaker Clarify is the correct tool — NOT Comprehend itself.

Multimodal Content Moderation

high freq

Amazon ComprehendAmazon Rekognition

Rekognition handles image/video content moderation (explicit content, unsafe images); Comprehend handles text content moderation (toxic language, PII in text, sentiment). They are complementary — one for visual, one for text.

Structured NLP + Generative AI Hybrid

high freq

Amazon ComprehendAmazon Bedrock

Comprehend performs fast, cost-efficient structured NLP (entity extraction, PII detection, classification) as a pre-processing or post-processing step around Bedrock LLM calls. For example: detect PII with Comprehend before sending text to Bedrock to prevent data leakage.

PII Detection and Alerting Pipeline

high freq

Amazon ComprehendAmazon DynamoDBAmazon SNS

Comprehend PII detection flags sensitive data in incoming text; Lambda stores sanitized records in DynamoDB and triggers SNS alerts for compliance teams when PII is detected. Common compliance and data governance pattern.

Service Limits & Quotas

LimitValueNote

Maximum document size for real-time (synchronous) API

5,000 bytes (UTF-8 encoded) bytes per request

Candidates confuse this with the async limit. Batch jobs support much larger documents and entire S3 prefixes — always route large documents to async.

Maximum document size for asynchronous batch jobs

1 MB per document MB

The 1 MB limit is per document in the batch, not per job. Total job size can be much larger across many documents.

Maximum number of custom entity recognizer annotations

Minimum 200 annotations per entity type required to train annotations

This is a minimum floor, not a maximum. More annotations generally produce better models. Do not confuse with custom classifier requirements.

Maximum number of entity types in a custom entity recognizer

25 entity types per model entity types

Pre-trained entity detection supports a fixed set of built-in types (PERSON, LOCATION, ORGANIZATION, etc.) separate from custom types.

Maximum number of custom classifier labels (multi-class)

1,000 labels per model labels

Multi-class (one label per document) and multi-label (multiple labels per document) modes have different training data format requirements.

Maximum number of topics for topic modeling jobs

100 topics topics

Topic modeling is unsupervised — you must specify how many topics you want. It does not label topics; you interpret the output word clusters.

Languages supported for pre-trained NLP APIs

100+ languages for language detection; subset (12+) for full NLP features languages

Do not assume all NLP features are available in all detected languages. Always verify feature-language matrix for production use.

Comprehend Medical — maximum document size (real-time)

20,000 characters characters

Comprehend Medical is a separate service endpoint with its own limits, pricing, and entity types (medical conditions, medications, anatomy, etc.). It is NOT HIPAA-authorized by default — you must sign a BAA.

Minimum training documents for custom classifier (multi-class)

10 documents per label minimum, 1,000 documents recommended documents

Meeting the minimum does not guarantee a useful model. Exam scenarios testing 'why is model accuracy low' often point to insufficient training data.

Maximum concurrent async analysis jobs per account

Soft limit — varies by region; default is typically 10 concurrent jobs concurrent jobs

This is a soft limit (adjustable) unlike some hard byte limits. Distinguishing soft vs. hard limits is a common exam pattern.

Pricing Model

Pay-per-use (per unit of text processed); custom model training and endpoint hosting billed separately

Pre-trained NLP APIs: priced per 100 characters of text processed (minimum 3 units / 300 characters per request)
Custom model training: billed per training hour (model training job duration)
Custom model inference endpoints: billed per hour the endpoint is provisioned (regardless of traffic) — this is a common cost surprise
Async batch jobs: priced per 100 characters processed, similar to real-time but often cheaper at scale with volume tiers
Topic modeling jobs: billed per 100 characters in the input corpus
Comprehend Medical: separate, higher per-character pricing than standard Comprehend
Volume discounts apply as monthly character usage crosses pricing tiers — significant savings at high volume
No upfront costs; no minimum fees for pre-trained API usage

Exam Tips

criticalResponsible AI / AI Fairness

Comprehend does NOT detect algorithmic bias — SageMaker Clarify is the correct service for bias detection and explainability. This is the #1 misconception in AIF-C01 exam questions. If a question asks about detecting bias in ML model predictions or training data, the answer is never Comprehend.

criticalService Limits / Architecture Design

The 5,000-byte limit for synchronous (real-time) Comprehend APIs is a hard limit. Any architecture question involving documents larger than 5KB must use asynchronous batch processing via S3, not real-time API calls. This distinction drives architecture decisions in SAA-C03 and SAP-C02 scenarios.

criticalComprehend Medical vs. Standard Comprehend

Comprehend Medical is a SEPARATE service from Amazon Comprehend with its own API endpoints, pricing, limits, and entity types. It is NOT just Comprehend with a medical flag. Comprehend Medical understands clinical context (ICD-10-CM, RxNorm, medical relationships) that standard Comprehend cannot.

criticalResponsible AI Terminology

Content filtering (removing inappropriate content from outputs) is NOT the same as bias detection (identifying unfair model behavior toward demographic groups). Comprehend can help with content filtering via PII detection and sentiment, but it cannot detect or measure algorithmic bias. CloudWatch monitors metrics but also cannot detect bias.

critical

Comprehend CANNOT detect algorithmic bias — SageMaker Clarify is the ONLY correct AWS service for bias detection and model explainability. If an exam question mentions 'bias detection' and lists Comprehend as an option, eliminate it immediately.

critical

The 5,000-byte real-time limit is a hard architectural constraint. Any scenario with documents larger than ~5KB must route to async S3-based batch processing. This drives architecture decisions in SAA-C03 and SAP-C02 design questions.

critical

Content filtering ≠ Bias detection ≠ Fairness. These are three distinct concepts. Comprehend helps with content filtering (PII redaction, sentiment). CloudWatch monitors metrics. SageMaker Clarify detects bias. None of these are interchangeable on the AIF-C01 exam.

importantCost Optimization / Pricing

Custom model endpoints in Comprehend are billed by the hour regardless of whether they receive traffic — similar to SageMaker real-time endpoints. If cost optimization is asked, consider async batch inference instead of persistent endpoints for non-real-time workloads.

importantTopic Modeling / Unsupervised ML

Topic modeling in Comprehend is UNSUPERVISED and ASYNC only. You must specify the number of topics (1–100); Comprehend will not auto-determine the optimal number. Output is word clusters per topic — you must interpret and label topics yourself. It does not tell you 'this document is about finance.'

importantPII / Data Privacy / Compliance

PII detection and PII redaction are different Comprehend features. Detection identifies and returns PII entity locations and types. Redaction returns the original text with PII replaced by placeholders. Exam scenarios about GDPR/CCPA compliance pipelines often test whether you know Comprehend can REDACT, not just detect.

importantSentiment Analysis Granularity

Targeted Sentiment is different from document-level sentiment. Document-level sentiment gives one sentiment score for the whole document. Targeted sentiment ties sentiment to specific entities (e.g., 'the battery [negative] but the screen [positive]'). Use targeted sentiment for product review analysis requiring attribute-level insights.

importantAWS AI/ML Stack Layers

For the AIF-C01 exam: Comprehend is classified as a pre-trained AI service (like Rekognition, Transcribe, Translate). It represents the 'AI services' layer — no ML expertise required, no model training necessary for built-in features. Custom models exist but are optional extensions, not the primary value proposition.

importantService Integration Order / Architecture

When integrating Comprehend with Textract, the correct order is always: Textract FIRST (extracts text from images/PDFs) → Comprehend SECOND (analyzes the extracted text). Comprehend cannot process images directly — it only accepts UTF-8 encoded text input.

Common Misconceptions & Traps

Common Mistake

Amazon Comprehend can detect algorithmic bias in ML models or training datasets

Correct

Comprehend has NO bias detection capability. Amazon SageMaker Clarify is the purpose-built AWS service for detecting bias in training data and model predictions, and for generating explainability reports. Comprehend is purely for NLP tasks on text content.

This is the #1 trap in AIF-C01 exam questions on Responsible AI. The confusion arises because Comprehend 'analyzes' content, leading candidates to assume it can 'analyze' model fairness. Remember: Comprehend analyzes TEXT MEANING; Clarify analyzes MODEL BEHAVIOR. If you see 'bias detection' in an answer choice involving Comprehend, it is almost certainly wrong.

Common Mistake

CloudWatch can detect algorithmic bias in AI/ML models

Correct

CloudWatch is an observability service for metrics, logs, and alarms. It can monitor model performance metrics (accuracy, latency, error rates) but has absolutely no capability to detect whether a model is biased against demographic groups. Bias detection requires statistical analysis of model predictions across subgroups — that is SageMaker Clarify's job.

This misconception appears because CloudWatch monitors 'model performance' and candidates conflate performance monitoring with fairness monitoring. Remember: CloudWatch = operational metrics; Clarify = fairness/explainability. These are completely different dimensions of model evaluation.

Common Mistake

Post-processing filters on Comprehend output can solve underlying model bias

Correct

Post-processing filters can mask or suppress biased outputs but do not fix the root cause of bias in the model. True bias mitigation requires addressing bias in training data (pre-processing), modifying the model training objective (in-processing), or using calibrated correction techniques — all evaluated with SageMaker Clarify, not Comprehend.

This is a responsible AI trap testing whether candidates understand that content filtering ≠ bias remediation. A filter that blocks certain outputs is a band-aid, not a solution. Exam questions may present 'add a post-processing filter' as a way to 'ensure fairness' — this is incorrect. Fairness requires measurement and root-cause treatment, not output suppression.

Common Mistake

Having a large volume of training data guarantees that a Comprehend custom model will be fair and unbiased

Correct

Data volume does not guarantee fairness. If training data is historically biased (e.g., underrepresents certain demographic groups, contains societal stereotypes), more data amplifies that bias rather than correcting it. Data quality, diversity, and representativeness matter more than volume for fairness. SageMaker Clarify can detect bias in training data regardless of its size.

This misconception reflects a fundamental misunderstanding of ML fairness. Candidates assume 'more data = better model = fairer model' but this conflates accuracy with fairness. A model can be highly accurate on average while being systematically unfair to subgroups. Exam questions test whether you know that fairness requires intentional measurement and intervention, not just data accumulation.

Common Mistake

Comprehend and Amazon Lex are competing services that do the same thing

Correct

They are complementary and serve different purposes. Lex manages conversational dialog flows and intent classification for chatbots. Comprehend performs NLP analysis (sentiment, entities, key phrases) on text. A best-practice architecture uses Lex for conversation management and Comprehend to enrich understanding — for example, detecting customer frustration via sentiment to trigger escalation.

Both services process text, leading candidates to think they overlap. The key distinction: Lex is CONVERSATIONAL AI (dialog management, slot filling, intent matching); Comprehend is TEXT ANALYTICS (extract insights from any text). They work together, not instead of each other.

Common Mistake

Amazon Comprehend can process images and PDFs directly

Correct

Comprehend ONLY accepts UTF-8 encoded plain text. It cannot process images, scanned PDFs, Word documents, or any binary format. To analyze text in images or PDFs, you must first use Amazon Textract to extract the text, then pass the extracted text to Comprehend.

This causes architecture failures when candidates design pipelines that send PDFs directly to Comprehend. The correct pattern is always Textract → Comprehend for document-based workflows. Remember: Comprehend = text in, insights out. If your input is not already plain text, you need a text extraction step first.

Common Mistake

Content filtering (blocking inappropriate words or outputs) is the same as AI bias detection

Correct

Content filtering removes or blocks specific types of content (profanity, PII, harmful language) based on rules or classifiers. Bias detection measures whether a model treats different demographic groups inequitably in its predictions. These are entirely separate concerns. Comprehend can assist with content filtering (PII redaction, sentiment-based filtering); SageMaker Clarify handles bias detection.

This confusion is explicitly tested in AIF-C01 Responsible AI domain questions. The exam presents scenarios where a team wants to 'ensure their AI is fair' and offers content filtering as a solution — this is wrong. Fairness requires statistical analysis of model outputs across demographic subgroups, not filtering of specific content types.

Memory Tricks

🧠

COMPREHEND = 'CEKPLS' → Classification (custom), Entities, Key phrases, PII, Language detection, Sentiment — the 6 core NLP superpowers

🧠

BIAS DETECTION rule: 'Comprehend READS text, Clarify JUDGES fairness' — never confuse reading with judging

🧠

TEXTRACT → COMPREHEND order: 'Extract THEN Examine' — you must get the text out before you can understand it

🧠

5KB SYNC / 1MB ASYNC: 'Small docs = Sync, Big docs = Batch' — if it's bigger than a tweet thread, go async

🧠

Comprehend Medical ≠ Comprehend: 'Medical needs its own Medicine' — separate endpoint, separate pricing, separate entity types

🧠

Topic Modeling is UNSUPERVISED: 'Topics are a Mystery Box — YOU pick the number, Comprehend fills the box, YOU read the label'

🧠

CloudWatch ≠ Bias Detector: 'CloudWatch watches METRICS, not MORALITY' — operational monitoring cannot measure fairness

CertAI Tutor · AIF-C01, SAA-C03, SAP-C02, CLF-C02 · 2026-02-22

Ready to test your knowledge?

Practice AIF-C01, SAA-C03, SAP-C02, CLF-C02 exam questions with AI-powered explanations — free to start.

Amazon Comprehend: The NLP Intelligence Engine

Overview

Key Features

Integration Patterns

Service Limits & Quotas

Pricing Model

Exam Tips

Common Misconceptions & Traps

Memory Tricks

Ready to test your knowledge?

Related Cheat Sheets