
Cargando...
Turn any text into lifelike speech using deep learning — the AWS service that gives your applications a voice
Amazon Polly is a fully managed cloud service that uses advanced deep learning technologies to synthesize natural-sounding human speech from text. It supports dozens of languages and voices, offering both standard (concatenative) and neural (NTTS) text-to-speech engines, enabling developers to create applications that talk. Polly outputs audio in formats like MP3, OGG, and PCM, and supports SSML for fine-grained speech control.
Convert written text into natural-sounding audio for applications such as e-learning platforms, accessibility tools, voice-enabled apps, IVR systems, and content narration — without managing any speech infrastructure.
Use When
Avoid When
Neural Text-to-Speech (NTTS)
Uses deep learning for significantly more natural, human-like voices compared to standard engine; supports a subset of voices
Standard (Concatenative) TTS Engine
Older engine, more voices available, lower cost than NTTS, but less natural sounding
SSML (Speech Synthesis Markup Language)
Fine-grained control over speech: pauses, emphasis, pronunciation, speaking rate, pitch, volume, and more
Speech Marks
Returns JSON metadata with timing data for word, sentence, viseme (lip-sync), and SSML marks — useful for karaoke-style highlighting or animation
Custom Lexicons (PLS format)
Define custom pronunciations for domain-specific terms (medical, legal, brand names) using W3C PLS standard
Asynchronous Synthesis (StartSpeechSynthesisTask)
For texts up to 100,000 characters; output delivered to an S3 bucket; poll with GetSpeechSynthesisTask
Real-time Streaming Synthesis
SynthesizeSpeech returns an audio stream directly for low-latency applications
Newscaster Speaking Style
Available only with NTTS engine for select voices — sounds like professional broadcast journalism
Conversational Speaking Style
NTTS-only style; more casual, natural tone suited for virtual assistants and chatbots
Brand Voice (Custom voice)
AWS can create a unique voice trained on your brand's audio — requires working directly with AWS; not self-service
Multi-language support
Dozens of languages and regional variants; language selection is per-voice, not per-request
VPC Endpoint support (PrivateLink)
Polly can be accessed privately within a VPC without traversing the public internet
CloudTrail integration
All Polly API calls are logged in AWS CloudTrail for auditing and compliance
KMS encryption for async output
S3 output from StartSpeechSynthesisTask can be encrypted with a customer-managed KMS key
Full Conversational AI Voice Loop
high freqLex handles natural language understanding and dialog management; Polly converts Lex's text responses into spoken audio. Together they create complete voice-enabled chatbots and virtual assistants. Lex can natively call Polly for voice output in some configurations.
Bidirectional Voice Pipeline
high freqTranscribe converts user speech to text (STT); Polly converts application responses back to speech (TTS). This pair is the foundation of any voice application. Critical distinction: Transcribe = speech IN, Polly = speech OUT.
Async Audio Content Generation
high freqStartSpeechSynthesisTask writes MP3/OGG files directly to a specified S3 bucket. S3 then serves as the distribution layer — files can be delivered via CloudFront for global low-latency audio streaming. Essential pattern for podcast/e-learning content generation pipelines.
Serverless On-Demand TTS
high freqLambda functions invoke Polly's SynthesizeSpeech API in real time to generate audio on demand, streaming the result back to users or saving to S3. Enables event-driven TTS without managing servers.
IVR Voice Response System
medium freqAmazon Connect contact flows use Polly voices (including NTTS) to speak dynamic, personalized messages to callers. Polly voices are selectable directly within Connect contact flow blocks.
Sentiment-Aware Speech
medium freqComprehend analyzes text sentiment; results inform SSML parameters passed to Polly (e.g., adjust speaking rate or emphasis based on detected emotion). Comprehend = text analysis; Polly = text output as audio.
Generative AI Voice Output
medium freqBedrock LLMs generate dynamic text responses; Polly converts them to speech. This creates voice-enabled generative AI applications without managing TTS infrastructure.
Document-to-Audio Pipeline
medium freqTextract extracts text from scanned documents or PDFs; the extracted text is passed to Polly to generate audio versions. Useful for accessibility — turning printed documents into audio books.
Global Audio CDN Distribution
low freqPre-generated Polly audio files stored in S3 are distributed globally via CloudFront for low-latency audio delivery. Ideal for e-learning platforms with international users.
Polly is TEXT → SPEECH ONLY. It NEVER converts speech to text. If a question asks about converting audio/voice to text, the answer is Amazon Transcribe, not Polly. This directional confusion is the #1 trap.
For texts longer than 3,000 characters, you MUST use the asynchronous StartSpeechSynthesisTask API (supports up to 100,000 characters) with output delivered to S3. Real-time SynthesizeSpeech will fail or truncate beyond 3,000 characters.
Neural TTS (NTTS) produces more natural speech but costs ~4x more than standard voices. On cost-optimization questions, standard voices are preferred for high-volume, non-customer-facing use cases. NTTS is preferred for customer-facing applications where voice quality matters.
The complete voice assistant architecture is: Transcribe (user speech → text) → Lex (understand intent) → Lambda (business logic) → Polly (response text → speech). Know this pipeline cold for SAA-C03 and SAP-C02.
Polly = Text → Speech ONLY. ANY question about converting audio/speech to text = Amazon Transcribe. This directional confusion is the most common reason candidates lose Polly-related points.
For content longer than 3,000 characters, ALWAYS use asynchronous StartSpeechSynthesisTask with S3 output — real-time SynthesizeSpeech cannot handle it. This is the most tested architectural decision for Polly.
The complete voice assistant stack: Transcribe (STT) + Lex (NLU) + Lambda (logic) + Polly (TTS). Each service has ONE job. Polly ONLY speaks the final response — it understands nothing.
SSML tags are NOT billed as characters but give you powerful control: use <break> for pauses, <emphasis> for stress, <prosody> for rate/pitch/volume, and <phoneme> for pronunciation. This is tested in architect-level questions about customizing voice output.
Speech Marks output format is JSON — it does NOT produce audio. It produces metadata (timing, word boundaries, visemes for lip-sync). If a question asks about synchronizing text highlights with audio playback, Speech Marks is the answer.
Custom Lexicons use the W3C PLS (Pronunciation Lexicon Specification) format and are REGIONAL — you must upload them to each AWS region where you use Polly. Up to 100 lexicons per region, up to 5 applied per request.
Amazon Polly integrates natively with Amazon Connect for IVR systems. When a question describes a call center needing dynamic voice messages without pre-recording, the pattern is Connect + Polly (NTTS for most natural sound).
For the AIF-C01 exam: Polly is a pre-trained AI service (no model training required by the user). It sits in the 'AI Services' layer of AWS AI/ML stack — above SageMaker (build your own) and Bedrock (foundation models). You consume it via API without ML expertise.
Polly supports VPC Endpoints (PrivateLink) — for compliance scenarios where audio synthesis must not traverse the public internet, this is the correct architectural choice. Pair with KMS encryption for async S3 output.
Common Mistake
Amazon Polly can transcribe speech/audio into text
Correct
Polly is exclusively TEXT-TO-SPEECH (TTS). It converts written text into spoken audio. The reverse operation — converting spoken audio to text — is performed by Amazon Transcribe.
This is the #1 Polly misconception on all certification exams. The names don't inherently signal direction. Memory trick: 'Polly SPEAKS' (output), 'Transcribe LISTENS' (input). On exams, any mention of converting audio/voice recordings to text = Transcribe, not Polly.
Common Mistake
Amazon Polly and Amazon Lex are interchangeable for building voice assistants
Correct
They serve completely different roles: Lex handles UNDERSTANDING (NLU, intent detection, dialog management). Polly handles OUTPUT (converting text responses to speech). A complete voice assistant needs BOTH — plus Transcribe for voice input.
Candidates confuse conversational AI (Lex) with speech synthesis (Polly). Lex cannot produce audio; Polly cannot understand language. On exams, if the requirement is 'understand what the user said,' the answer involves Lex. If the requirement is 'speak a response,' the answer involves Polly.
Common Mistake
Amazon Comprehend can be used instead of Polly for text-to-speech or audio generation
Correct
Amazon Comprehend is a Natural Language Processing (NLP) service that ANALYZES text — it extracts sentiment, entities, key phrases, and language. It produces structured data, not audio. Polly produces audio from text.
This misconception appears in questions about AI service selection. Comprehend = text analysis (input: text, output: insights). Polly = text synthesis (input: text, output: audio). They can be COMBINED (Comprehend analyzes sentiment → Polly speaks with appropriate tone via SSML) but are not substitutes.
Common Mistake
SSML tags count toward the billed character limit
Correct
SSML markup tags themselves (e.g., <break/>, <emphasis>, </prosody>) are NOT billed. Only the actual text content (the words being spoken) counts toward character billing.
This matters for cost estimation. A heavily SSML-annotated document with 2,000 words of text and 500 characters of SSML tags is billed for the 2,000 words only. Misunderstanding this leads to overestimating costs in exam cost-optimization scenarios.
Common Mistake
You can use a single Polly lexicon globally across all regions
Correct
Lexicons are REGIONAL resources. If you use Polly in us-east-1 and eu-west-1, you must upload your custom lexicons separately to each region. There is no global lexicon store.
This is an architectural trap in multi-region deployment questions. Failing to account for regional lexicon replication in a deployment pipeline will result in inconsistent pronunciation in non-primary regions.
Common Mistake
Neural TTS (NTTS) is always better and should always be chosen over Standard voices
Correct
NTTS produces more natural speech but costs approximately 4x more per character. For high-volume, non-customer-facing use cases (e.g., internal notifications, batch audio generation), Standard voices are the cost-optimal choice. NTTS is best for customer-facing applications where voice quality directly impacts user experience.
Cost-optimization questions will present scenarios where NTTS is technically superior but Standard is the correct answer because of cost constraints. Always evaluate the use case context — internal vs. customer-facing, volume, and budget.
Common Mistake
Amazon Polly requires machine learning expertise to use
Correct
Polly is a fully managed, pre-trained AI service accessible via API. No ML knowledge, model training, or data science expertise is required. You call the API with text and receive audio — the deep learning is entirely abstracted away.
On the AIF-C01 exam, understanding the distinction between AI Services (Polly, Rekognition, Transcribe — pre-trained, API-only), ML Platforms (SageMaker — build your own), and Foundation Model services (Bedrock) is critical. Polly sits firmly in the 'no ML expertise required' category.
POLLY SPEAKS, TRANSCRIBE LISTENS — Polly = Parrot (speaks text back), Transcribe = Secretary (writes down what it hears)
The Voice Pipeline: T-L-P (Transcribe → Lex → Polly) = 'TaLk to Polly' — user Talks (Transcribe), system Listens/thinks (Lex), Polly responds
3K sync, 100K async — 'Three thousand for fast, a hundred thousand for VAST (async tasks)'
LEX = Language EXpert (understands), POLLY = POLite LY speaker (outputs) — they're a team, not competitors
SSML = Super Speech Markup Language — tags are FREE, text is BILLED
CertAI Tutor · SAA-C03, SAP-C02, AIF-C01, CLF-C02 · 2026-02-22
In the Same Category
Comparisons
Guides & Patterns