ml aiSAA-C03SAP-C02AIF-C01CLF-C02

Amazon Textract: The Document Intelligence Powerhouse

Automatically extract text, forms, tables, and structured data from virtually any document — no ML expertise required

Updated 2026-02-22

Overview

Amazon Textract is a fully managed machine learning service that automatically extracts text, handwriting, tables, forms, and structured data from scanned documents and images — going far beyond simple OCR. It understands document context and relationships, enabling extraction of key-value pairs from forms and cell data from tables without custom code or ML training. Textract is purpose-built for document processing workflows in industries like healthcare, finance, legal, and insurance.

Convert unstructured or semi-structured documents (PDFs, images, forms, tables) into machine-readable, structured data that downstream services and applications can act upon — eliminating manual data entry and enabling intelligent document processing pipelines.

Use When

Extracting key-value pairs from structured forms such as tax documents (W-2s, 1099s), insurance claims, or medical intake forms
Processing invoices, receipts, and contracts to capture tables, line items, and signatures at scale
Building automated mortgage or loan application pipelines that ingest scanned PDFs and route extracted fields to downstream systems
Digitizing large archives of historical paper records (medical records, legal documents) for search and analytics
Identity document processing — extracting fields from passports, driver's licenses, and ID cards using the Analyze ID feature
Expense report automation where receipts are photographed and fields are parsed without manual entry

Avoid When

Natural language understanding or sentiment analysis on text — use Amazon Comprehend instead; Textract extracts text but does not interpret meaning or classify it
Audio or video transcription — use Amazon Transcribe; Textract only processes image and PDF document inputs
General image recognition (object detection, facial analysis, scene understanding) — use Amazon Rekognition; Textract is specialized for document text extraction
Structured/tabular data transformation and cleansing from databases or CSV files — use AWS Glue DataBrew; Textract is not designed for already-digital structured data
Real-time sub-100ms latency requirements on very high volumes — Textract is optimized for accuracy over raw speed; consider caching or pre-processing strategies

Key Features

DetectDocumentText (OCR)

Basic text and word detection — synchronous, single-page, returns raw text blocks

AnalyzeDocument (Forms & Tables)

Extracts key-value pairs from forms and structured data from tables; supports FORMS, TABLES, QUERIES, and SIGNATURES feature types

AnalyzeID (Identity Documents)

Specialized extraction for passports and driver's licenses — parses standardized ID fields automatically

AnalyzeExpense (Receipts & Invoices)

Extracts line items, totals, vendor names, and dates from expense documents — purpose-built for financial document processing

Queries Feature (Natural Language)

Ask specific questions about document content (e.g., 'What is the total amount?') — reduces noise by returning only relevant data

Signature Detection

Detects presence of signatures on documents — useful for compliance and legal workflows

Asynchronous Processing

StartDocumentTextDetection and StartDocumentAnalysis for multi-page, large documents via S3

Human Review Integration (A2I)

Amazon Augmented AI (A2I) can be triggered when Textract confidence scores fall below a threshold for human review

Handwriting Detection

Detects handwritten text in addition to printed text — accuracy varies with handwriting quality

Bounding Box Coordinates

Every extracted element includes geometric position data (bounding box) for document reconstruction or UI highlighting

Confidence Scores

Each extracted element includes a confidence score (0-100) enabling conditional routing for human review

Custom Queries / Lending AI

Amazon Textract Lending AI provides specialized extraction for mortgage and lending documents

Custom models / fine-tuning

Textract is a pre-trained managed service — you cannot fine-tune it. Use Amazon SageMaker for custom document ML models.

Real-time streaming input

Textract processes static documents — not live video or streaming data feeds

Integration Patterns

Document Intelligence Pipeline

high freq

Amazon TextractAmazon Comprehend

Textract extracts raw text and structured data from documents; Comprehend then performs NLP tasks (entity recognition, sentiment analysis, classification, PII detection) on the extracted text. This is the canonical pattern for intelligent document processing — Textract handles 'what does the document say' and Comprehend handles 'what does it mean'.

Async Document Processing Pipeline

high freq

Amazon TextractAmazon S3AWS LambdaAmazon SQS

Documents uploaded to S3 trigger a Lambda function that calls StartDocumentAnalysis. Textract publishes completion notifications to SNS/SQS. A consumer Lambda retrieves results via GetDocumentAnalysis. This event-driven pattern handles large volumes of multi-page documents without blocking.

Human-in-the-Loop Review

high freq

Amazon TextractAmazon Augmented AI (A2I)

When Textract returns low confidence scores on extracted fields, A2I routes those documents to human reviewers via a workforce (private, Mechanical Turk, or vendor). Reviewers correct extractions, and results are fed back. Essential for regulated industries requiring accuracy guarantees.

Generative AI Document Q&A

high freq

Amazon TextractAmazon Bedrock

Textract extracts structured text from documents; extracted content is passed as context to a foundation model in Amazon Bedrock for summarization, question-answering, or document comparison. Enables RAG (Retrieval Augmented Generation) patterns on document archives.

Async Completion Notification

high freq

Amazon TextractAmazon SNSAWS Lambda

Textract async jobs publish job completion events to Amazon SNS. Lambda functions subscribe to SNS to retrieve and process results. This decouples document submission from result retrieval, enabling scalable, resilient pipelines.

Document + Image Analysis

medium freq

Amazon TextractAmazon Rekognition

Textract handles text/form extraction from documents while Rekognition handles embedded images, faces, or objects within those documents. Used in ID verification workflows: Rekognition compares the photo on an ID card while Textract extracts the text fields.

Multilingual Document Processing

medium freq

Amazon TextractAmazon Translate

Textract extracts text from documents in the source language; Amazon Translate converts the extracted text to the target language. Enables processing of foreign-language documents (contracts, medical records) in a unified pipeline.

Document Data Lake Ingestion

medium freq

Amazon TextractAWS GlueAmazon Athena

Textract extracts structured data from documents (invoices, forms) and stores results as JSON in S3. AWS Glue crawlers catalog the data; Athena enables SQL queries across thousands of extracted documents for analytics and reporting.

Custom Post-Processing ML

medium freq

Amazon TextractAmazon SageMaker

Textract provides raw extraction; SageMaker models perform custom classification, validation, or entity linking on the extracted fields. Used when Textract's built-in features are insufficient and domain-specific ML is needed on the extracted output.

Intelligent Document Search

medium freq

Amazon TextractAmazon Kendra

Textract extracts text from scanned documents and PDFs; Amazon Kendra indexes the extracted content for intelligent enterprise search. Enables semantic search across document archives that were previously unsearchable (scanned images, handwritten notes).

Service Limits & Quotas

LimitValueNote

Maximum document size (synchronous API)

5 MB per document

Candidates frequently assume the same limit applies to both sync and async — async supports up to 500 MB stored in S3.

Maximum document size (asynchronous API, S3 source)

500 MB per document

The async limit is 100x larger than sync — a common exam distractor uses 5 MB as the universal limit.

Maximum pages per document (synchronous)

1 page pages

This is one of the most tested Textract limits — many candidates assume sync can handle multi-page PDFs.

Maximum pages per document (asynchronous)

3,000 pages pages

Verify current quota in AWS documentation as service quotas may be updated.

Supported image formats

JPEG, PNG, TIFF, PDF formats

PDF is supported but only for documents — not as a container for embedded videos or other media.

Transactions per second (synchronous DetectDocumentText)

Varies by region and account type TPS

Exact TPS values are region-specific and subject to change — check the AWS Service Quotas console for your account.

Supported languages

English (primary); limited support for other languages depending on feature languages

Candidates sometimes assume Textract has the same broad language support as Amazon Comprehend or Translate — it does not.

Queries per AnalyzeDocument call (Queries feature)

Up to 15 queries per page queries per page

The Queries feature is newer and less known — expect it to appear in AI/ML-focused exam scenarios.

Pricing Model

Pay-per-page processed (no upfront cost, no minimum fee)

Pricing is per page processed, with different rates for different API features: DetectDocumentText (basic OCR) is cheapest; AnalyzeDocument with FORMS and TABLES features costs more per page
Asynchronous and synchronous APIs are priced identically on a per-page basis — you pay for pages processed, not API call type
AnalyzeExpense and AnalyzeID have their own per-page pricing tiers, typically higher than basic text detection
The Queries feature is priced per query per page in addition to base AnalyzeDocument pricing
There is no charge for failed API calls — you only pay for successfully processed pages
AWS Free Tier includes a limited number of pages per month for the first 3 months for new accounts (verify current free tier limits in AWS pricing page)

Exam Tips

criticalSync vs Async API selection

Synchronous APIs (DetectDocumentText, AnalyzeDocument) only support SINGLE-PAGE documents up to 5 MB. Multi-page PDFs ALWAYS require asynchronous APIs (StartDocumentTextDetection, StartDocumentAnalysis) with documents stored in S3. This is the most tested Textract architectural decision.

criticalService boundaries and complementary AI services

Textract is EXTRACTION only — it does not understand, classify, or derive meaning from text. For NLP tasks on extracted text (sentiment, entity recognition, PII detection, classification), you MUST pair Textract with Amazon Comprehend. These two services are complementary, not interchangeable.

criticalAmazon Augmented AI (A2I) integration

When exam questions describe a need for human review of low-confidence document extractions in regulated industries (healthcare, finance, legal), the answer is Amazon Textract + Amazon Augmented AI (A2I). A2I is the purpose-built service for human-in-the-loop ML workflows.

critical

SYNC = 1 page, 5 MB max. ASYNC = multi-page (up to 3,000), 500 MB, requires S3. Any multi-page document question → async API is the ONLY correct answer.

critical

Textract EXTRACTS; Comprehend ANALYZES. They are complementary, never interchangeable. Textract cannot understand meaning; Comprehend cannot read images. Intelligent document processing requires BOTH.

critical

For low-confidence extractions requiring human review → Amazon Augmented AI (A2I). For identity documents → AnalyzeID. For invoices/receipts → AnalyzeExpense. Always use the most specific API or integration for the task.

importantAnalyzeDocument feature types

The AnalyzeDocument API supports four distinct feature types: FORMS (key-value pairs), TABLES (structured grid data), QUERIES (natural language questions), and SIGNATURES. Each feature type costs differently and must be explicitly specified. Don't confuse DetectDocumentText (raw OCR only) with AnalyzeDocument (structured extraction).

importantSpecialized Textract APIs

For expense and receipt processing, use AnalyzeExpense — not AnalyzeDocument with TABLES. AnalyzeExpense is purpose-built for invoices and receipts and understands vendor names, line items, totals, and tax fields without custom configuration.

importantAsync architecture with SNS/SQS

Textract async jobs require an IAM role that Textract can assume to publish to SNS. The completion notification goes to SNS → SQS (or Lambda). Always remember the SNS → SQS fan-out pattern for reliable async result retrieval at scale.

importantAnalyzeID and identity verification

For identity document processing (passports, driver's licenses), use AnalyzeID — not generic AnalyzeDocument. AnalyzeID understands ID document structure and returns standardized field names. Pair with Rekognition for face comparison in identity verification workflows.

importantManaged vs custom ML services

Textract CANNOT be fine-tuned or retrained. It is a pre-trained managed service. If a question asks about customizing the underlying model, the answer is Amazon SageMaker — not Textract. Textract's 'customization' comes through the Queries feature and post-processing logic, not model retraining.

importantAWS AI service taxonomy

For the AIF-C01 exam: Textract falls under the 'AI Services' category — pre-built, API-driven, no ML expertise required. Contrast with SageMaker (custom ML platform) and Bedrock (foundation models). Textract is the correct answer when the scenario involves document text extraction without custom model development.

Good to KnowCost optimization

In cost-optimization scenarios, use DetectDocumentText when you only need raw text (cheapest). Use AnalyzeDocument with only the specific feature types you need (FORMS or TABLES, not both if unnecessary). Enabling unnecessary feature types wastes money.

Good to KnowQueries feature

The Queries feature is a powerful differentiator: instead of parsing all extracted blocks to find a specific value, you ask Textract a natural language question (e.g., 'What is the policy number?') and get a targeted answer. This reduces downstream processing complexity and is increasingly tested.

Common Misconceptions & Traps

Common Mistake

Amazon Comprehend can replace Textract for document data extraction — it's an AI service that understands text, so it should work on documents too.

Correct

Amazon Comprehend requires TEXT INPUT — it cannot process images, scanned PDFs, or any document format. Comprehend analyzes text that has already been extracted. Textract must first extract the text from documents, then Comprehend can analyze it. They are complementary, not substitutable.

This is the #1 Textract misconception on exams. The trap question presents a document processing scenario and offers Comprehend as a 'smarter' alternative. Remember: Comprehend = NLP on text; Textract = extraction from documents. You need BOTH for intelligent document processing.

Common Mistake

AWS Glue DataBrew can process and extract data from scanned documents and images because it's a data preparation service.

Correct

AWS Glue DataBrew is designed for STRUCTURED data (CSV, JSON, Parquet, database tables). It has no capability to process images, PDFs, or unstructured documents. Textract is the correct service for extracting data from document images. DataBrew could be used AFTER Textract has extracted and structured the data.

Exam questions exploit the broad definition of 'data preparation' to trick candidates into selecting DataBrew for document workflows. The key discriminator: if the source data is an image or scanned document, DataBrew is wrong — Textract is right.

Common Mistake

Amazon Textract + Amazon Macie provides a complete document processing and transformation workflow for sensitive documents.

Correct

Textract extracts data from documents; Macie discovers and protects sensitive data (PII, PHI) stored in S3. While both can be used together, they do NOT form a 'transformation workflow.' Macie is a security/compliance service, not a data transformation tool. For transformation after extraction, use Lambda, Glue, or Step Functions.

This misconception appears in questions about healthcare or financial document pipelines. Macie's role is to DETECT sensitive data in S3 storage — it doesn't transform or process document content. Don't confuse data security (Macie) with data extraction (Textract) or data transformation (Glue/Lambda).

Common Mistake

Textract can analyze images for objects, scenes, and faces — it's an AI vision service that handles all image analysis needs.

Correct

Textract is EXCLUSIVELY for extracting text, forms, and tables from documents. It cannot detect objects, recognize faces, identify scenes, or perform general computer vision tasks. Amazon Rekognition is the correct service for image and video analysis. Textract only 'sees' text and document structure.

Both Textract and Rekognition work with images, which causes confusion. The mental model: if the image IS a document (form, invoice, ID), use Textract. If the image CONTAINS objects/faces/scenes, use Rekognition. In ID verification, you use BOTH: Textract for text fields, Rekognition for face matching.

Common Mistake

Textract synchronous APIs can process multi-page PDFs — just pass the PDF file directly to the API.

Correct

Synchronous Textract APIs only process SINGLE-PAGE documents. Multi-page PDFs MUST use asynchronous APIs (StartDocumentTextDetection, StartDocumentAnalysis), which require the document to be stored in Amazon S3 first. Passing a multi-page PDF to a synchronous API will result in only the first page being processed or an error.

This is a critical architectural trap. Exam scenarios describe processing 50-page contracts or 200-page reports and ask which API to use. The answer is always async for multi-page. Sync = 1 page max. Async = up to 3,000 pages from S3.

Common Mistake

All AWS AI services are interchangeable for building AI agents — any of them can perform document understanding, so pick whichever is most familiar.

Correct

Each AWS AI service has a specific, non-overlapping purpose. Textract = document text/structure extraction. Comprehend = NLP on text. Rekognition = image/video analysis. Transcribe = speech-to-text. Translate = language translation. These are specialized tools, not general-purpose AI agents. Using the wrong service for a task either produces wrong results or simply doesn't work.

The AIF-C01 exam specifically tests understanding of which AI service to use for which task. The trap is selecting a 'nearby' AI service that sounds plausible. Always match the input type and desired output type to the correct service.

Common Mistake

Textract's AnalyzeDocument with TABLES feature is the best way to process expense reports and invoices.

Correct

AnalyzeExpense is purpose-built for expense documents and invoices — it understands expense-specific semantics (vendor, line items, totals, tax, tip) and returns structured expense data. AnalyzeDocument TABLES would treat an invoice as a generic table, missing semantic context. Always use the most specific API for the document type.

Exam questions test whether you know the specialized APIs (AnalyzeExpense, AnalyzeID) versus the generic ones. The specialized APIs produce better results and are the architecturally correct choice when available.

Memory Tricks

🧠

TEXTRACT = TEXT EXTRACT — it only EXTRACTS text from documents; it does NOT understand, classify, or analyze meaning (that's Comprehend's job). Think: 'Textract is a miner, Comprehend is the analyst.'

🧠

SYNC = SINGLE page (S=S). ASYNC = Any number of pages (up to 3,000). When you see 'multi-page' in an exam question, your brain should immediately say 'ASYNC + S3'.

🧠

The Textract API family: Detect (raw text) → Analyze (forms/tables/queries) → AnalyzeExpense (invoices) → AnalyzeID (identity docs). Each step is more specialized. Match the API to the document TYPE.

🧠

For human review of uncertain extractions: A2I = 'Accuracy Insurance' — when Textract isn't sure, A2I calls in a human. Low confidence score → A2I → Human reviewer → Corrected output.

🧠

Textract + Comprehend = EXTRACT then ANALYZE. Textract mines the ore (raw text from documents); Comprehend refines it (finds meaning, entities, sentiment). You need both for the full pipeline.

CertAI Tutor · SAA-C03, SAP-C02, AIF-C01, CLF-C02 · 2026-02-22

Ready to test your knowledge?

Practice SAA-C03, SAP-C02, AIF-C01, CLF-C02 exam questions with AI-powered explanations — free to start.

Amazon Textract: The Document Intelligence Powerhouse

Overview

Key Features

Integration Patterns

Service Limits & Quotas

Pricing Model

Exam Tips

Common Misconceptions & Traps

Memory Tricks

Ready to test your knowledge?

Related Cheat Sheets