ml aiSAA-C03SAP-C02AIF-C01CLF-C02

Amazon Rekognition: The Visual Intelligence Engine

Fully managed computer vision AI that finds faces, objects, text, and unsafe content in images and video — no ML expertise required.

Updated 2026-02-22

Overview

Amazon Rekognition is a fully managed deep learning-based computer vision service that analyzes images and videos to detect objects, scenes, faces, text, celebrities, and inappropriate content. It offers two core APIs — Rekognition Image for synchronous analysis and Rekognition Video for asynchronous, streaming, or stored video analysis — with no infrastructure to manage. Rekognition Custom Labels extends the service by allowing you to train custom models on your own labeled images using AutoML, without writing a single line of ML code.

Automate visual content moderation, identity verification, searchable media archives, and workplace safety monitoring at scale without building or managing ML models.

Use When

Content moderation: automatically detecting nudity, violence, or hate symbols in user-generated images and video before publishing
Identity verification: comparing a selfie against a stored photo ID to verify user identity during onboarding (face comparison/search)
Media asset management: indexing and searching millions of images or video frames by detected objects, scenes, celebrities, or custom labels
Workplace safety: detecting whether workers are wearing required PPE (hard hats, masks, vests) on a manufacturing floor using Rekognition PPE detection
Document and text extraction from images when combined with Textract — e.g., detecting a license plate in an image and then reading the plate number

Avoid When

PII discovery in documents or structured text — use Amazon Comprehend (which has a dedicated DetectPiiEntities API) or Amazon Macie for S3 data classification; Rekognition operates on pixels, not text semantics
Transcribing audio or extracting speech from video — use Amazon Transcribe; Rekognition does not process audio tracks
Extracting structured data from forms, tables, or multi-page PDFs — use Amazon Textract, which is purpose-built for document layout understanding
Detecting prompt injection attacks or semantic threats in text — use Amazon Bedrock Guardrails or Amazon Comprehend; Rekognition has no natural language understanding capability
Training large-scale custom computer vision models requiring fine-grained control over architecture or hyperparameters — use Amazon SageMaker for full ML lifecycle control

Key Features

Object and Scene Detection (DetectLabels)

Returns labels with confidence scores and bounding boxes for 3,000+ object and scene categories

Face Detection (DetectFaces)

Detects face attributes: age range, emotions, gender, glasses, smile, pose, quality, landmarks

Face Comparison (CompareFaces)

Compares two face images and returns similarity score; used for 1:1 identity verification

Face Search / Collections (SearchFacesByImage)

1:N face search against an indexed collection; used for identity lookup at scale

Celebrity Recognition (RecognizeCelebrities)

Identifies thousands of celebrities with name and IMDb/Wikipedia links

Content Moderation (DetectModerationLabels)

Hierarchical taxonomy of unsafe content: nudity, violence, hate symbols, drugs, etc.

Text in Image Detection (DetectText)

Detects and reads printed and handwritten text in images (not a replacement for Textract on documents)

Custom Labels (AutoML)

Train custom object/scene classifiers on your own labeled images with no ML code

PPE Detection

Detects personal protective equipment (hard hats, face covers, hand covers) on persons in images

Asynchronous Video Analysis (Start/Get APIs)

Label detection, face detection, content moderation, celebrity recognition, text detection on stored S3 video

Streaming Video Analysis (Kinesis Video Streams)

Real-time face search against collections on live video streams

Segment Detection (StartSegmentDetection)

Detects technical cues (black frames, end credits, slates) and shot changes in stored video

Image Properties (DetectLabels with imagePropertiesVersion)

Returns dominant colors, image quality, and foreground/background analysis

GDPR / data privacy controls

Face collections can be deleted; no training on customer data without explicit opt-in

Audio analysis

Rekognition processes only visual content (pixels). Use Amazon Transcribe for audio/speech.

PDF / multi-page document processing

Use Amazon Textract for structured document extraction

Real-time object detection on streaming video (non-face)

Streaming Video via KVS only supports face search. Label/object detection on streaming requires custom architecture (SageMaker + KVS)

Integration Patterns

Event-driven image moderation pipeline

high freq

Amazon RekognitionAmazon S3AWS LambdaAmazon SNS

S3 PutObject event triggers a Lambda function that calls Rekognition DetectModerationLabels. If confidence exceeds threshold, Lambda publishes to SNS to alert moderators or auto-quarantine the object. This is the canonical serverless content moderation architecture.

Real-time face recognition on live video

high freq

Amazon RekognitionAmazon Kinesis Video StreamsAWS LambdaAmazon DynamoDB

Camera streams video to Kinesis Video Streams. Rekognition Streaming Video processor performs face search against a pre-indexed collection in near real-time. Matches are written to DynamoDB for access logging or alerting. Used in security/surveillance and employee timekeeping scenarios.

Intelligent document + image understanding pipeline

high freq

Amazon RekognitionAmazon TextractAmazon Comprehend

Rekognition detects and crops relevant image regions (e.g., ID card photo, signature area). Textract extracts structured text from the document. Comprehend analyzes extracted text for entities or PII. Together these three services cover the full spectrum of document intelligence — a common multi-service exam scenario.

Custom Labels vs. SageMaker decision boundary

high freq

Amazon RekognitionAmazon SageMaker

Use Rekognition Custom Labels when you need a custom computer vision model with minimal ML expertise, small-to-medium datasets, and AutoML convenience. Use SageMaker when you need full control over model architecture, hyperparameter tuning, custom training loops, or very large datasets. The exam tests this decision boundary frequently.

Multimodal media analysis pipeline

high freq

Amazon RekognitionAmazon TranscribeAmazon TranslateAmazon Comprehend

Rekognition analyzes video frames for visual content (labels, faces, text). Transcribe converts the audio track to text. Translate localizes the transcript. Comprehend performs sentiment/entity analysis on the text. Each service handles its modality — the exam tests knowing which service owns which data type.

Vision + generative AI augmentation

medium freq

Amazon RekognitionAmazon Bedrock

Rekognition provides structured metadata (labels, bounding boxes, confidence scores) from images. Bedrock multimodal models (e.g., Claude, Titan) can then generate natural language descriptions, summaries, or Q&A responses about that visual content. Rekognition handles the structured detection; Bedrock handles open-ended generative reasoning.

Async video analysis orchestration

medium freq

Amazon RekognitionAWS Step FunctionsAmazon S3Amazon DynamoDB

Step Functions orchestrates a workflow: start Rekognition video job (StartLabelDetection), poll for completion via SNS/SQS callback, retrieve results (GetLabelDetection), store structured metadata in DynamoDB, and archive processed video back to S3 Glacier. This pattern handles the async nature of Rekognition Video APIs.

Service Limits & Quotas

LimitValueNote

Maximum image size (stored in S3)

15 MB per image

Candidates frequently confuse the 5 MB (bytes payload) vs 15 MB (S3 reference) limits — the exam exploits this distinction.

Maximum image size (raw bytes in request)

5 MB per image

This is the most commonly confused limit with the 15 MB S3 limit.

Minimum image dimensions

80 x 80 pixels

Images smaller than 80x80 pixels will be rejected; relevant when designing thumbnail pipelines.

Maximum image dimensions

10,000 x 10,000 pixels

Very large images (e.g., satellite imagery) may need to be tiled before sending to Rekognition.

Supported image formats

JPEG, PNG formats

Candidates assume GIF is supported because it is a common web format; it is not.

Maximum faces in a collection (default)

No hard limit stated; collection scales to millions of faces faces

Face collections are designed for large-scale identity search (IndexFaces + SearchFacesByImage). There is no documented hard cap on collection size.

Maximum faces detected per image

100 faces

This limit applies to DetectFaces. IndexFaces also indexes up to 100 faces per image by default.

Maximum video file size (stored video analysis)

10 GB per video

For StartLabelDetection and similar async APIs, the video stored in S3 must be ≤10 GB.

Maximum video duration (stored video analysis)

6 hours per video

Videos longer than 6 hours must be split before processing with the asynchronous video APIs.

Supported video formats

MP4, MOV (H.264 codec recommended) formats

Rekognition Video supports MP4 and MOV containers. H.264 is the recommended codec. MKV and AVI are NOT supported.

Rekognition Streaming Video — maximum stream duration

Up to 2 hours per streaming session (Kinesis Video Streams integration) per session

Streaming video analysis via Kinesis Video Streams is designed for real-time use cases like live surveillance. Sessions are bounded.

Custom Labels — minimum training images per label

1 (recommended ≥10 for acceptable accuracy) images per label

Rekognition Custom Labels can technically start training with very few images, but AWS recommends at least 10 per label for meaningful accuracy. Exam scenarios about 'minimal data' custom models point to Custom Labels over SageMaker.

Rekognition Custom Labels — inference units

Billed per inference unit per hour (minimum 1 unit) inference units/hour

Standard Rekognition APIs are pay-per-API-call (serverless). Custom Labels hosted endpoints are pay-per-hour — a critical architectural and cost distinction.

API throttling — default TPS (varies by API and region)

Varies by API; soft limits adjustable via Service Quotas transactions per second

Default TPS limits are soft limits. For high-volume production workloads, request a quota increase via AWS Service Quotas console.

Pricing Model

Pay-per-API-call for Image APIs; Pay-per-minute of video processed for Video APIs; Pay-per-inference-unit-per-hour for Custom Labels hosted endpoints

Image APIs (DetectLabels, DetectFaces, etc.) are billed per 1,000 images analyzed — tiered pricing decreases at higher volumes (first 1M images/month at a higher rate, then lower rates beyond that threshold)
Face Search (IndexFaces, SearchFacesByImage) has an additional charge for face metadata stored in collections, billed per 1,000 face-months
Video APIs are billed per minute of video processed (rounded up), with separate rates for stored video vs. streaming video
Custom Labels inference is billed per inference unit per hour — models must be explicitly stopped to avoid ongoing charges; this is a common cost-management exam scenario
Custom Labels training is billed per compute-hour consumed during training
There is a Free Tier: 5,000 image API calls/month and 1,000 minutes of video/month for the first 12 months (not applicable to Custom Labels)

Exam Tips

criticalService boundary — the #1 source of wrong answers on Rekognition questions

Rekognition is a VISUAL service only — it processes pixels (images and video frames). It has zero capability for audio analysis, natural language processing, document layout understanding, or PII detection in text. When an exam question asks about detecting PII, use Comprehend. When it asks about document text extraction, use Textract. When it asks about audio, use Transcribe.

criticalCustom Labels pricing model vs. standard API pricing

Rekognition Custom Labels hosted endpoints are NOT serverless — they are billed per inference unit per hour while running. You must call StartProjectVersion to start and StopProjectVersion to stop. Standard Rekognition APIs (DetectLabels, etc.) ARE serverless pay-per-call. Exam scenarios about cost optimization for custom vision models will test whether you know to stop idle Custom Labels endpoints.

criticalFace verification vs. face identification

For face COMPARISON (1:1 verification — 'is this person the same as this photo?'), use CompareFaces. For face SEARCH (1:N identification — 'who is this person from my database of 10,000?'), use IndexFaces + SearchFacesByImage with a Face Collection. The exam distinguishes verification vs. identification scenarios.

criticalAsync video processing pattern

Rekognition Video APIs are ASYNCHRONOUS. You call Start* (e.g., StartLabelDetection), receive a JobId, and then either poll GetLabelDetection or configure an SNS topic to receive a completion notification. You cannot get results synchronously for stored video. Design patterns must account for this async workflow (SNS + SQS or Step Functions polling).

critical

Rekognition is PIXELS ONLY — no audio, no NLP, no document structure. PII in text → Comprehend. Speech → Transcribe. Form extraction → Textract. Visual content → Rekognition.

critical

Custom Labels endpoints are billed per-hour (not per-call) and must be explicitly started/stopped. A scenario about reducing Custom Labels costs = stop idle models with a scheduled Lambda.

critical

Video APIs are always ASYNC (Start* → SNS notification → Get*). Image APIs are always SYNC. Streaming Video (KVS) only supports face search — NOT label or object detection.

importantStreaming video capability limitation

Streaming Video analysis via Kinesis Video Streams ONLY supports face search (SearchFaces). It does NOT support real-time label/object detection or content moderation on live streams. If an exam scenario needs real-time object detection on a live stream, the answer involves SageMaker or a custom solution — NOT Rekognition streaming.

importantSupported image formats

Image format support is JPEG and PNG only. If a scenario involves GIF, TIFF, HEIC, WebP, or BMP images, a conversion step (e.g., Lambda using PIL/Pillow, or AWS Elemental MediaConvert for video) must be included in the architecture before calling Rekognition.

importantContent moderation design pattern

Content moderation results include a hierarchical taxonomy (e.g., 'Explicit Nudity' → 'Graphic Female Nudity'). The API returns confidence scores, and YOU set the confidence threshold in your application logic — Rekognition does not auto-block content. This is important for designing human-in-the-loop review workflows.

importantCustom Labels vs. SageMaker decision

When comparing Rekognition Custom Labels vs. SageMaker: choose Custom Labels for minimal ML expertise, AutoML convenience, and small/medium labeled datasets. Choose SageMaker when you need custom model architectures, full hyperparameter control, large datasets, or integration with MLOps pipelines. The keyword 'no ML expertise' in exam scenarios points to Custom Labels.

Good to KnowImage size limits

The 5 MB (raw bytes) vs. 15 MB (S3 reference) image size limit is a frequently tested trap. Best practice: always store images in S3 and pass the S3 object reference to Rekognition APIs rather than embedding image bytes in the request payload.

Good to KnowRegional resource scope

Rekognition integrates with AWS IAM for access control. Face collections are regional resources — a collection created in us-east-1 cannot be queried from us-west-2. Architecture designs for global face search require either replication of face collections or routing to the correct region.

Common Misconceptions & Traps

Common Mistake

Amazon Rekognition can detect PII (Personally Identifiable Information) in images and documents, making it a data security tool.

Correct

Rekognition detects VISUAL content — faces, objects, text pixels in images. It cannot semantically understand that a string of digits is a Social Security Number or that a name is PII. For PII detection in text, use Amazon Comprehend (DetectPiiEntities). For PII discovery in S3 data stores, use Amazon Macie.

Exam questions frequently present scenarios about 'detecting sensitive data' and include Rekognition as a distractor. The key signal: if the scenario involves text/data semantics → Comprehend/Macie. If it involves visual pixels → Rekognition.

Common Mistake

Rekognition's content moderation feature can detect prompt injection attacks or adversarial AI content, making it useful for AI safety guardrails.

Correct

Rekognition content moderation detects visual categories of unsafe content (nudity, violence, hate symbols, drugs) in images and video. It has no capability to analyze text semantics, detect malicious prompts, or identify AI-generated adversarial inputs. For prompt injection defense, use Amazon Bedrock Guardrails.

The AIF-C01 exam specifically tests AI safety tooling. Rekognition is a distractor in AI safety scenarios. Remember: Rekognition = visual safety (pixels). Bedrock Guardrails = AI safety (prompts and text).

Common Mistake

Rekognition Custom Labels works the same way as standard Rekognition APIs — pay-per-call, always available, no management needed.

Correct

Custom Labels requires you to explicitly START a model endpoint (StartProjectVersion) before inference and STOP it (StopProjectVersion) when done. You are billed per inference unit per hour while the model is running, regardless of whether you make API calls. Leaving a Custom Labels model running 24/7 with low traffic is a significant cost waste.

This is a critical cost optimization trap on SAA-C03 and SAP-C02. The exam will describe a scenario with an idle Custom Labels model and ask how to reduce costs — the answer is to stop the model when not in use, potentially using EventBridge + Lambda to schedule start/stop.

Common Mistake

Rekognition Video APIs return results synchronously, just like Rekognition Image APIs.

Correct

Rekognition Video APIs for stored video (StartLabelDetection, StartFaceDetection, etc.) are fully asynchronous. You submit a job, receive a JobId, and must either poll GetLabelDetection or configure an SNS topic for completion notification. Only Rekognition Image APIs are synchronous.

Architectures that try to call StartLabelDetection and immediately call GetLabelDetection will fail because the job won't be complete. The correct pattern uses SNS notification → SQS → Lambda → GetLabelDetection, or Step Functions with a wait state.

Common Mistake

Amazon Rekognition Streaming Video can detect any type of object or label in real-time from a live video feed.

Correct

Rekognition Streaming Video (via Kinesis Video Streams) ONLY supports face search against a pre-indexed face collection. Real-time label detection, content moderation, or PPE detection on live streams is NOT supported through the streaming API. These capabilities only work on stored video (async) or images (sync).

Exam scenarios about real-time surveillance often ask about detecting specific objects (weapons, vehicles) on live streams. Rekognition streaming cannot do this — the answer would involve SageMaker real-time inference with a custom model connected to KVS.

Common Mistake

Rekognition's DetectText API is equivalent to Amazon Textract and can extract structured data from forms and tables.

Correct

Rekognition DetectText detects and reads text that appears visually in an image (e.g., a street sign, a watermark, a license plate). It returns raw text strings and bounding boxes. It has NO understanding of document structure, form fields, table cells, or key-value pairs. Amazon Textract is purpose-built for structured document understanding with AnalyzeDocument (forms and tables).

Both services 'read text,' but the use cases are completely different. Rekognition DetectText = text in natural scenes. Textract = structured documents. Exam scenarios about processing invoices, tax forms, or medical records → Textract. Scenarios about reading text on physical objects in photos → Rekognition.

Memory Tricks

🧠

REKOGNITION = RETINA: It only sees what's in the PICTURE. No ears (no audio = no Transcribe), no brain for language (no NLP = no Comprehend), no document reader (no layout = no Textract). If it's not a pixel, it's not Rekognition's job.

🧠

Custom Labels = CUSTOM CHEF: You have to TURN ON the kitchen (StartProjectVersion) before cooking and TURN IT OFF when done (StopProjectVersion) — or you pay for a running kitchen with no customers.

🧠

Face COMPARE (1:1) = 'Is THIS you?' | Face SEARCH (1:N) = 'WHO are you?' — Compare two photos, Search a database.

🧠

Video = ASYNC always. Image = SYNC always. Remember: Videos take time to process; you can't hold the phone line open waiting.

CertAI Tutor · SAA-C03, SAP-C02, AIF-C01, CLF-C02 · 2026-02-22

Ready to test your knowledge?

Practice SAA-C03, SAP-C02, AIF-C01, CLF-C02 exam questions with AI-powered explanations — free to start.

Amazon Rekognition: The Visual Intelligence Engine

Overview

Key Features

Integration Patterns

Service Limits & Quotas

Pricing Model

Exam Tips

Common Misconceptions & Traps

Memory Tricks

Ready to test your knowledge?

Related Cheat Sheets