
Cargando...
Automatically extract text, forms, tables, and structured data from virtually any document — no ML expertise required
Amazon Textract is a fully managed machine learning service that automatically extracts text, handwriting, tables, forms, and structured data from scanned documents and images — going far beyond simple OCR. It understands document context and relationships, enabling extraction of key-value pairs from forms and cell data from tables without custom code or ML training. Textract is purpose-built for document processing workflows in industries like healthcare, finance, legal, and insurance.
Convert unstructured or semi-structured documents (PDFs, images, forms, tables) into machine-readable, structured data that downstream services and applications can act upon — eliminating manual data entry and enabling intelligent document processing pipelines.
Use When
Avoid When
DetectDocumentText (OCR)
Basic text and word detection — synchronous, single-page, returns raw text blocks
AnalyzeDocument (Forms & Tables)
Extracts key-value pairs from forms and structured data from tables; supports FORMS, TABLES, QUERIES, and SIGNATURES feature types
AnalyzeID (Identity Documents)
Specialized extraction for passports and driver's licenses — parses standardized ID fields automatically
AnalyzeExpense (Receipts & Invoices)
Extracts line items, totals, vendor names, and dates from expense documents — purpose-built for financial document processing
Queries Feature (Natural Language)
Ask specific questions about document content (e.g., 'What is the total amount?') — reduces noise by returning only relevant data
Signature Detection
Detects presence of signatures on documents — useful for compliance and legal workflows
Asynchronous Processing
StartDocumentTextDetection and StartDocumentAnalysis for multi-page, large documents via S3
Human Review Integration (A2I)
Amazon Augmented AI (A2I) can be triggered when Textract confidence scores fall below a threshold for human review
Handwriting Detection
Detects handwritten text in addition to printed text — accuracy varies with handwriting quality
Bounding Box Coordinates
Every extracted element includes geometric position data (bounding box) for document reconstruction or UI highlighting
Confidence Scores
Each extracted element includes a confidence score (0-100) enabling conditional routing for human review
Custom Queries / Lending AI
Amazon Textract Lending AI provides specialized extraction for mortgage and lending documents
Custom models / fine-tuning
Textract is a pre-trained managed service — you cannot fine-tune it. Use Amazon SageMaker for custom document ML models.
Real-time streaming input
Textract processes static documents — not live video or streaming data feeds
Document Intelligence Pipeline
high freqTextract extracts raw text and structured data from documents; Comprehend then performs NLP tasks (entity recognition, sentiment analysis, classification, PII detection) on the extracted text. This is the canonical pattern for intelligent document processing — Textract handles 'what does the document say' and Comprehend handles 'what does it mean'.
Async Document Processing Pipeline
high freqDocuments uploaded to S3 trigger a Lambda function that calls StartDocumentAnalysis. Textract publishes completion notifications to SNS/SQS. A consumer Lambda retrieves results via GetDocumentAnalysis. This event-driven pattern handles large volumes of multi-page documents without blocking.
Human-in-the-Loop Review
high freqWhen Textract returns low confidence scores on extracted fields, A2I routes those documents to human reviewers via a workforce (private, Mechanical Turk, or vendor). Reviewers correct extractions, and results are fed back. Essential for regulated industries requiring accuracy guarantees.
Generative AI Document Q&A
high freqTextract extracts structured text from documents; extracted content is passed as context to a foundation model in Amazon Bedrock for summarization, question-answering, or document comparison. Enables RAG (Retrieval Augmented Generation) patterns on document archives.
Async Completion Notification
high freqTextract async jobs publish job completion events to Amazon SNS. Lambda functions subscribe to SNS to retrieve and process results. This decouples document submission from result retrieval, enabling scalable, resilient pipelines.
Document + Image Analysis
medium freqTextract handles text/form extraction from documents while Rekognition handles embedded images, faces, or objects within those documents. Used in ID verification workflows: Rekognition compares the photo on an ID card while Textract extracts the text fields.
Multilingual Document Processing
medium freqTextract extracts text from documents in the source language; Amazon Translate converts the extracted text to the target language. Enables processing of foreign-language documents (contracts, medical records) in a unified pipeline.
Document Data Lake Ingestion
medium freqTextract extracts structured data from documents (invoices, forms) and stores results as JSON in S3. AWS Glue crawlers catalog the data; Athena enables SQL queries across thousands of extracted documents for analytics and reporting.
Custom Post-Processing ML
medium freqTextract provides raw extraction; SageMaker models perform custom classification, validation, or entity linking on the extracted fields. Used when Textract's built-in features are insufficient and domain-specific ML is needed on the extracted output.
Intelligent Document Search
medium freqTextract extracts text from scanned documents and PDFs; Amazon Kendra indexes the extracted content for intelligent enterprise search. Enables semantic search across document archives that were previously unsearchable (scanned images, handwritten notes).
Synchronous APIs (DetectDocumentText, AnalyzeDocument) only support SINGLE-PAGE documents up to 5 MB. Multi-page PDFs ALWAYS require asynchronous APIs (StartDocumentTextDetection, StartDocumentAnalysis) with documents stored in S3. This is the most tested Textract architectural decision.
Textract is EXTRACTION only — it does not understand, classify, or derive meaning from text. For NLP tasks on extracted text (sentiment, entity recognition, PII detection, classification), you MUST pair Textract with Amazon Comprehend. These two services are complementary, not interchangeable.
When exam questions describe a need for human review of low-confidence document extractions in regulated industries (healthcare, finance, legal), the answer is Amazon Textract + Amazon Augmented AI (A2I). A2I is the purpose-built service for human-in-the-loop ML workflows.
SYNC = 1 page, 5 MB max. ASYNC = multi-page (up to 3,000), 500 MB, requires S3. Any multi-page document question → async API is the ONLY correct answer.
Textract EXTRACTS; Comprehend ANALYZES. They are complementary, never interchangeable. Textract cannot understand meaning; Comprehend cannot read images. Intelligent document processing requires BOTH.
For low-confidence extractions requiring human review → Amazon Augmented AI (A2I). For identity documents → AnalyzeID. For invoices/receipts → AnalyzeExpense. Always use the most specific API or integration for the task.
The AnalyzeDocument API supports four distinct feature types: FORMS (key-value pairs), TABLES (structured grid data), QUERIES (natural language questions), and SIGNATURES. Each feature type costs differently and must be explicitly specified. Don't confuse DetectDocumentText (raw OCR only) with AnalyzeDocument (structured extraction).
For expense and receipt processing, use AnalyzeExpense — not AnalyzeDocument with TABLES. AnalyzeExpense is purpose-built for invoices and receipts and understands vendor names, line items, totals, and tax fields without custom configuration.
Textract async jobs require an IAM role that Textract can assume to publish to SNS. The completion notification goes to SNS → SQS (or Lambda). Always remember the SNS → SQS fan-out pattern for reliable async result retrieval at scale.
For identity document processing (passports, driver's licenses), use AnalyzeID — not generic AnalyzeDocument. AnalyzeID understands ID document structure and returns standardized field names. Pair with Rekognition for face comparison in identity verification workflows.
Textract CANNOT be fine-tuned or retrained. It is a pre-trained managed service. If a question asks about customizing the underlying model, the answer is Amazon SageMaker — not Textract. Textract's 'customization' comes through the Queries feature and post-processing logic, not model retraining.
For the AIF-C01 exam: Textract falls under the 'AI Services' category — pre-built, API-driven, no ML expertise required. Contrast with SageMaker (custom ML platform) and Bedrock (foundation models). Textract is the correct answer when the scenario involves document text extraction without custom model development.
In cost-optimization scenarios, use DetectDocumentText when you only need raw text (cheapest). Use AnalyzeDocument with only the specific feature types you need (FORMS or TABLES, not both if unnecessary). Enabling unnecessary feature types wastes money.
The Queries feature is a powerful differentiator: instead of parsing all extracted blocks to find a specific value, you ask Textract a natural language question (e.g., 'What is the policy number?') and get a targeted answer. This reduces downstream processing complexity and is increasingly tested.
Common Mistake
Amazon Comprehend can replace Textract for document data extraction — it's an AI service that understands text, so it should work on documents too.
Correct
Amazon Comprehend requires TEXT INPUT — it cannot process images, scanned PDFs, or any document format. Comprehend analyzes text that has already been extracted. Textract must first extract the text from documents, then Comprehend can analyze it. They are complementary, not substitutable.
This is the #1 Textract misconception on exams. The trap question presents a document processing scenario and offers Comprehend as a 'smarter' alternative. Remember: Comprehend = NLP on text; Textract = extraction from documents. You need BOTH for intelligent document processing.
Common Mistake
AWS Glue DataBrew can process and extract data from scanned documents and images because it's a data preparation service.
Correct
AWS Glue DataBrew is designed for STRUCTURED data (CSV, JSON, Parquet, database tables). It has no capability to process images, PDFs, or unstructured documents. Textract is the correct service for extracting data from document images. DataBrew could be used AFTER Textract has extracted and structured the data.
Exam questions exploit the broad definition of 'data preparation' to trick candidates into selecting DataBrew for document workflows. The key discriminator: if the source data is an image or scanned document, DataBrew is wrong — Textract is right.
Common Mistake
Amazon Textract + Amazon Macie provides a complete document processing and transformation workflow for sensitive documents.
Correct
Textract extracts data from documents; Macie discovers and protects sensitive data (PII, PHI) stored in S3. While both can be used together, they do NOT form a 'transformation workflow.' Macie is a security/compliance service, not a data transformation tool. For transformation after extraction, use Lambda, Glue, or Step Functions.
This misconception appears in questions about healthcare or financial document pipelines. Macie's role is to DETECT sensitive data in S3 storage — it doesn't transform or process document content. Don't confuse data security (Macie) with data extraction (Textract) or data transformation (Glue/Lambda).
Common Mistake
Textract can analyze images for objects, scenes, and faces — it's an AI vision service that handles all image analysis needs.
Correct
Textract is EXCLUSIVELY for extracting text, forms, and tables from documents. It cannot detect objects, recognize faces, identify scenes, or perform general computer vision tasks. Amazon Rekognition is the correct service for image and video analysis. Textract only 'sees' text and document structure.
Both Textract and Rekognition work with images, which causes confusion. The mental model: if the image IS a document (form, invoice, ID), use Textract. If the image CONTAINS objects/faces/scenes, use Rekognition. In ID verification, you use BOTH: Textract for text fields, Rekognition for face matching.
Common Mistake
Textract synchronous APIs can process multi-page PDFs — just pass the PDF file directly to the API.
Correct
Synchronous Textract APIs only process SINGLE-PAGE documents. Multi-page PDFs MUST use asynchronous APIs (StartDocumentTextDetection, StartDocumentAnalysis), which require the document to be stored in Amazon S3 first. Passing a multi-page PDF to a synchronous API will result in only the first page being processed or an error.
This is a critical architectural trap. Exam scenarios describe processing 50-page contracts or 200-page reports and ask which API to use. The answer is always async for multi-page. Sync = 1 page max. Async = up to 3,000 pages from S3.
Common Mistake
All AWS AI services are interchangeable for building AI agents — any of them can perform document understanding, so pick whichever is most familiar.
Correct
Each AWS AI service has a specific, non-overlapping purpose. Textract = document text/structure extraction. Comprehend = NLP on text. Rekognition = image/video analysis. Transcribe = speech-to-text. Translate = language translation. These are specialized tools, not general-purpose AI agents. Using the wrong service for a task either produces wrong results or simply doesn't work.
The AIF-C01 exam specifically tests understanding of which AI service to use for which task. The trap is selecting a 'nearby' AI service that sounds plausible. Always match the input type and desired output type to the correct service.
Common Mistake
Textract's AnalyzeDocument with TABLES feature is the best way to process expense reports and invoices.
Correct
AnalyzeExpense is purpose-built for expense documents and invoices — it understands expense-specific semantics (vendor, line items, totals, tax, tip) and returns structured expense data. AnalyzeDocument TABLES would treat an invoice as a generic table, missing semantic context. Always use the most specific API for the document type.
Exam questions test whether you know the specialized APIs (AnalyzeExpense, AnalyzeID) versus the generic ones. The specialized APIs produce better results and are the architecturally correct choice when available.
TEXTRACT = TEXT EXTRACT — it only EXTRACTS text from documents; it does NOT understand, classify, or analyze meaning (that's Comprehend's job). Think: 'Textract is a miner, Comprehend is the analyst.'
SYNC = SINGLE page (S=S). ASYNC = Any number of pages (up to 3,000). When you see 'multi-page' in an exam question, your brain should immediately say 'ASYNC + S3'.
The Textract API family: Detect (raw text) → Analyze (forms/tables/queries) → AnalyzeExpense (invoices) → AnalyzeID (identity docs). Each step is more specialized. Match the API to the document TYPE.
For human review of uncertain extractions: A2I = 'Accuracy Insurance' — when Textract isn't sure, A2I calls in a human. Low confidence score → A2I → Human reviewer → Corrected output.
Textract + Comprehend = EXTRACT then ANALYZE. Textract mines the ore (raw text from documents); Comprehend refines it (finds meaning, entities, sentiment). You need both for the full pipeline.
CertAI Tutor · SAA-C03, SAP-C02, AIF-C01, CLF-C02 · 2026-02-22
In the Same Category
Comparisons
Guides & Patterns