
Cargando...
Unlock meaning from text — entity extraction, sentiment, language detection, and custom ML models without writing a line of ML code
Amazon Comprehend is a fully managed Natural Language Processing (NLP) service that uses machine learning to find insights and relationships in unstructured text — including entities, key phrases, sentiment, language, syntax, and topics. It requires zero ML expertise to use pre-trained models, and supports custom entity recognition and custom classification for domain-specific needs. Comprehend is purpose-built for text analytics at scale, integrating natively with S3, Lambda, Kinesis, and other AWS services for real-time and batch pipelines.
Extract structured insights (entities, sentiment, topics, language) from unstructured text at scale using pre-trained or custom NLP models, enabling downstream analytics, automation, and responsible AI workflows
Use When
Avoid When
Sentiment Analysis (positive/negative/neutral/mixed)
Returns a dominant sentiment plus confidence scores for all four categories
Entity Recognition (built-in types: PERSON, LOCATION, ORG, DATE, etc.)
Pre-trained; no labeling required
Custom Entity Recognition
Train on your own entity types (e.g., product codes, internal IDs)
Custom Document Classification
Multi-class and multi-label modes supported
Key Phrase Extraction
Identifies noun phrases that are key to document meaning
Language Detection
100+ languages; returns dominant language with confidence score
Syntax Analysis (POS tagging)
Part-of-speech tagging: nouns, verbs, adjectives, etc.
Topic Modeling (LDA)
Unsupervised; async only; you specify number of topics (1–100)
PII Detection and Redaction
Identifies and optionally redacts PII types (SSN, credit card, email, phone, etc.)
Targeted Sentiment
Sentiment tied to specific entities within a document, not just document-level
Comprehend Medical (separate service)
Specialized for clinical/medical text; extracts medical entities, ICD-10-CM, RxNorm codes
Real-time (synchronous) analysis
Low-latency, single-document; 5,000-byte limit
Asynchronous batch analysis (S3)
Large-scale; up to 1 MB per doc; results written to S3
Flywheel (active learning pipeline)
Automates model retraining with new data; reduces ongoing MLOps burden
VPC support / PrivateLink
Comprehend endpoints can be accessed privately without internet egress
KMS encryption for training data and model artifacts
Customer-managed KMS keys supported for data at rest
IAM-based access control
Fine-grained resource-level permissions for endpoints and jobs
Amazon Comprehend endpoints (real-time custom model hosting)
Deploy custom models as persistent endpoints for synchronous inference
Built-in bias detection
CRITICAL MISCONCEPTION: Comprehend does NOT have built-in algorithmic bias detection. Use SageMaker Clarify for bias detection.
Document Intelligence Pipeline
high freqTextract extracts raw text and structured data (tables, forms) from scanned PDFs/images, then Comprehend performs NLP enrichment (entities, sentiment, PII detection) on the extracted text. Common for processing invoices, contracts, medical records, and legal documents.
Event-Driven Text Analytics
high freqLambda invokes Comprehend synchronous APIs in real time as text arrives (from API Gateway, SQS, Kinesis, or S3 triggers). Enables per-record NLP enrichment in serverless pipelines without managing infrastructure.
Batch NLP Analytics Data Lake
high freqRaw text files land in S3; async Comprehend batch jobs process them and write JSON results back to S3; Glue crawlers catalog the output; Athena queries insights. Ideal for large-scale document analytics without real-time requirements.
Real-Time Sentiment Streaming
high freqSocial media or customer feedback streams through Kinesis; Lambda enriches each record with Comprehend sentiment scores; enriched records are indexed in OpenSearch for real-time dashboards and alerting.
Intent + Sentiment Augmented Chatbot
high freqLex handles dialog management and intent classification; Comprehend adds sentiment analysis on user utterances to detect frustrated customers and trigger escalation paths or agent handoff. These are complementary, not competing services.
Custom NLP Model Development Pipeline
high freqSageMaker is used for advanced custom NLP model development, hyperparameter tuning, and experiment tracking, while Comprehend custom models offer a simpler managed path. For bias detection in Comprehend-generated outputs, SageMaker Clarify is the correct tool — NOT Comprehend itself.
Multimodal Content Moderation
high freqRekognition handles image/video content moderation (explicit content, unsafe images); Comprehend handles text content moderation (toxic language, PII in text, sentiment). They are complementary — one for visual, one for text.
Structured NLP + Generative AI Hybrid
high freqComprehend performs fast, cost-efficient structured NLP (entity extraction, PII detection, classification) as a pre-processing or post-processing step around Bedrock LLM calls. For example: detect PII with Comprehend before sending text to Bedrock to prevent data leakage.
PII Detection and Alerting Pipeline
high freqComprehend PII detection flags sensitive data in incoming text; Lambda stores sanitized records in DynamoDB and triggers SNS alerts for compliance teams when PII is detected. Common compliance and data governance pattern.
Comprehend does NOT detect algorithmic bias — SageMaker Clarify is the correct service for bias detection and explainability. This is the #1 misconception in AIF-C01 exam questions. If a question asks about detecting bias in ML model predictions or training data, the answer is never Comprehend.
The 5,000-byte limit for synchronous (real-time) Comprehend APIs is a hard limit. Any architecture question involving documents larger than 5KB must use asynchronous batch processing via S3, not real-time API calls. This distinction drives architecture decisions in SAA-C03 and SAP-C02 scenarios.
Comprehend Medical is a SEPARATE service from Amazon Comprehend with its own API endpoints, pricing, limits, and entity types. It is NOT just Comprehend with a medical flag. Comprehend Medical understands clinical context (ICD-10-CM, RxNorm, medical relationships) that standard Comprehend cannot.
Content filtering (removing inappropriate content from outputs) is NOT the same as bias detection (identifying unfair model behavior toward demographic groups). Comprehend can help with content filtering via PII detection and sentiment, but it cannot detect or measure algorithmic bias. CloudWatch monitors metrics but also cannot detect bias.
Comprehend CANNOT detect algorithmic bias — SageMaker Clarify is the ONLY correct AWS service for bias detection and model explainability. If an exam question mentions 'bias detection' and lists Comprehend as an option, eliminate it immediately.
The 5,000-byte real-time limit is a hard architectural constraint. Any scenario with documents larger than ~5KB must route to async S3-based batch processing. This drives architecture decisions in SAA-C03 and SAP-C02 design questions.
Content filtering ≠ Bias detection ≠ Fairness. These are three distinct concepts. Comprehend helps with content filtering (PII redaction, sentiment). CloudWatch monitors metrics. SageMaker Clarify detects bias. None of these are interchangeable on the AIF-C01 exam.
Custom model endpoints in Comprehend are billed by the hour regardless of whether they receive traffic — similar to SageMaker real-time endpoints. If cost optimization is asked, consider async batch inference instead of persistent endpoints for non-real-time workloads.
Topic modeling in Comprehend is UNSUPERVISED and ASYNC only. You must specify the number of topics (1–100); Comprehend will not auto-determine the optimal number. Output is word clusters per topic — you must interpret and label topics yourself. It does not tell you 'this document is about finance.'
PII detection and PII redaction are different Comprehend features. Detection identifies and returns PII entity locations and types. Redaction returns the original text with PII replaced by placeholders. Exam scenarios about GDPR/CCPA compliance pipelines often test whether you know Comprehend can REDACT, not just detect.
Targeted Sentiment is different from document-level sentiment. Document-level sentiment gives one sentiment score for the whole document. Targeted sentiment ties sentiment to specific entities (e.g., 'the battery [negative] but the screen [positive]'). Use targeted sentiment for product review analysis requiring attribute-level insights.
For the AIF-C01 exam: Comprehend is classified as a pre-trained AI service (like Rekognition, Transcribe, Translate). It represents the 'AI services' layer — no ML expertise required, no model training necessary for built-in features. Custom models exist but are optional extensions, not the primary value proposition.
When integrating Comprehend with Textract, the correct order is always: Textract FIRST (extracts text from images/PDFs) → Comprehend SECOND (analyzes the extracted text). Comprehend cannot process images directly — it only accepts UTF-8 encoded text input.
Common Mistake
Amazon Comprehend can detect algorithmic bias in ML models or training datasets
Correct
Comprehend has NO bias detection capability. Amazon SageMaker Clarify is the purpose-built AWS service for detecting bias in training data and model predictions, and for generating explainability reports. Comprehend is purely for NLP tasks on text content.
This is the #1 trap in AIF-C01 exam questions on Responsible AI. The confusion arises because Comprehend 'analyzes' content, leading candidates to assume it can 'analyze' model fairness. Remember: Comprehend analyzes TEXT MEANING; Clarify analyzes MODEL BEHAVIOR. If you see 'bias detection' in an answer choice involving Comprehend, it is almost certainly wrong.
Common Mistake
CloudWatch can detect algorithmic bias in AI/ML models
Correct
CloudWatch is an observability service for metrics, logs, and alarms. It can monitor model performance metrics (accuracy, latency, error rates) but has absolutely no capability to detect whether a model is biased against demographic groups. Bias detection requires statistical analysis of model predictions across subgroups — that is SageMaker Clarify's job.
This misconception appears because CloudWatch monitors 'model performance' and candidates conflate performance monitoring with fairness monitoring. Remember: CloudWatch = operational metrics; Clarify = fairness/explainability. These are completely different dimensions of model evaluation.
Common Mistake
Post-processing filters on Comprehend output can solve underlying model bias
Correct
Post-processing filters can mask or suppress biased outputs but do not fix the root cause of bias in the model. True bias mitigation requires addressing bias in training data (pre-processing), modifying the model training objective (in-processing), or using calibrated correction techniques — all evaluated with SageMaker Clarify, not Comprehend.
This is a responsible AI trap testing whether candidates understand that content filtering ≠ bias remediation. A filter that blocks certain outputs is a band-aid, not a solution. Exam questions may present 'add a post-processing filter' as a way to 'ensure fairness' — this is incorrect. Fairness requires measurement and root-cause treatment, not output suppression.
Common Mistake
Having a large volume of training data guarantees that a Comprehend custom model will be fair and unbiased
Correct
Data volume does not guarantee fairness. If training data is historically biased (e.g., underrepresents certain demographic groups, contains societal stereotypes), more data amplifies that bias rather than correcting it. Data quality, diversity, and representativeness matter more than volume for fairness. SageMaker Clarify can detect bias in training data regardless of its size.
This misconception reflects a fundamental misunderstanding of ML fairness. Candidates assume 'more data = better model = fairer model' but this conflates accuracy with fairness. A model can be highly accurate on average while being systematically unfair to subgroups. Exam questions test whether you know that fairness requires intentional measurement and intervention, not just data accumulation.
Common Mistake
Comprehend and Amazon Lex are competing services that do the same thing
Correct
They are complementary and serve different purposes. Lex manages conversational dialog flows and intent classification for chatbots. Comprehend performs NLP analysis (sentiment, entities, key phrases) on text. A best-practice architecture uses Lex for conversation management and Comprehend to enrich understanding — for example, detecting customer frustration via sentiment to trigger escalation.
Both services process text, leading candidates to think they overlap. The key distinction: Lex is CONVERSATIONAL AI (dialog management, slot filling, intent matching); Comprehend is TEXT ANALYTICS (extract insights from any text). They work together, not instead of each other.
Common Mistake
Amazon Comprehend can process images and PDFs directly
Correct
Comprehend ONLY accepts UTF-8 encoded plain text. It cannot process images, scanned PDFs, Word documents, or any binary format. To analyze text in images or PDFs, you must first use Amazon Textract to extract the text, then pass the extracted text to Comprehend.
This causes architecture failures when candidates design pipelines that send PDFs directly to Comprehend. The correct pattern is always Textract → Comprehend for document-based workflows. Remember: Comprehend = text in, insights out. If your input is not already plain text, you need a text extraction step first.
Common Mistake
Content filtering (blocking inappropriate words or outputs) is the same as AI bias detection
Correct
Content filtering removes or blocks specific types of content (profanity, PII, harmful language) based on rules or classifiers. Bias detection measures whether a model treats different demographic groups inequitably in its predictions. These are entirely separate concerns. Comprehend can assist with content filtering (PII redaction, sentiment-based filtering); SageMaker Clarify handles bias detection.
This confusion is explicitly tested in AIF-C01 Responsible AI domain questions. The exam presents scenarios where a team wants to 'ensure their AI is fair' and offers content filtering as a solution — this is wrong. Fairness requires statistical analysis of model outputs across demographic subgroups, not filtering of specific content types.
COMPREHEND = 'CEKPLS' → Classification (custom), Entities, Key phrases, PII, Language detection, Sentiment — the 6 core NLP superpowers
BIAS DETECTION rule: 'Comprehend READS text, Clarify JUDGES fairness' — never confuse reading with judging
TEXTRACT → COMPREHEND order: 'Extract THEN Examine' — you must get the text out before you can understand it
5KB SYNC / 1MB ASYNC: 'Small docs = Sync, Big docs = Batch' — if it's bigger than a tweet thread, go async
Comprehend Medical ≠ Comprehend: 'Medical needs its own Medicine' — separate endpoint, separate pricing, separate entity types
Topic Modeling is UNSUPERVISED: 'Topics are a Mystery Box — YOU pick the number, Comprehend fills the box, YOU read the label'
CloudWatch ≠ Bias Detector: 'CloudWatch watches METRICS, not MORALITY' — operational monitoring cannot measure fairness
CertAI Tutor · AIF-C01, SAA-C03, SAP-C02, CLF-C02 · 2026-02-22
In the Same Category
Comparisons
Guides & Patterns