ml aiSAA-C03SAP-C02AIF-C01CLF-C02

Amazon Polly: The Voice of AWS — Text-to-Speech Mastery

Turn any text into lifelike speech using deep learning — the AWS service that gives your applications a voice

Updated 2026-02-22

Overview

Amazon Polly is a fully managed cloud service that uses advanced deep learning technologies to synthesize natural-sounding human speech from text. It supports dozens of languages and voices, offering both standard (concatenative) and neural (NTTS) text-to-speech engines, enabling developers to create applications that talk. Polly outputs audio in formats like MP3, OGG, and PCM, and supports SSML for fine-grained speech control.

Convert written text into natural-sounding audio for applications such as e-learning platforms, accessibility tools, voice-enabled apps, IVR systems, and content narration — without managing any speech infrastructure.

Use When

Building accessibility features that read web or app content aloud for visually impaired users
Creating voice responses for IVR (Interactive Voice Response) phone systems integrated with Amazon Connect
Generating audio versions of articles, blog posts, or e-learning content at scale
Adding narration to games, animations, or media production pipelines where human voice recording is impractical
Powering real-time voice assistants or chatbot voice output (paired with Amazon Lex for full conversational AI)

Avoid When

Speech-to-text transcription: Use Amazon Transcribe instead — Polly only goes TEXT → SPEECH, never the reverse
Natural language understanding or intent detection: Use Amazon Lex or Amazon Comprehend — Polly has no understanding of meaning, only pronunciation
Voice biometric authentication: Polly synthesizes voices; it does not verify speaker identity — use third-party solutions for voice auth
Real-time conversational AI by itself: Polly is output-only; combine with Lex (understanding) and Transcribe (input) for full conversational loops

Key Features

Neural Text-to-Speech (NTTS)

Uses deep learning for significantly more natural, human-like voices compared to standard engine; supports a subset of voices

Standard (Concatenative) TTS Engine

Older engine, more voices available, lower cost than NTTS, but less natural sounding

SSML (Speech Synthesis Markup Language)

Fine-grained control over speech: pauses, emphasis, pronunciation, speaking rate, pitch, volume, and more

Speech Marks

Returns JSON metadata with timing data for word, sentence, viseme (lip-sync), and SSML marks — useful for karaoke-style highlighting or animation

Custom Lexicons (PLS format)

Define custom pronunciations for domain-specific terms (medical, legal, brand names) using W3C PLS standard

Asynchronous Synthesis (StartSpeechSynthesisTask)

For texts up to 100,000 characters; output delivered to an S3 bucket; poll with GetSpeechSynthesisTask

Real-time Streaming Synthesis

SynthesizeSpeech returns an audio stream directly for low-latency applications

Newscaster Speaking Style

Available only with NTTS engine for select voices — sounds like professional broadcast journalism

Conversational Speaking Style

NTTS-only style; more casual, natural tone suited for virtual assistants and chatbots

Brand Voice (Custom voice)

AWS can create a unique voice trained on your brand's audio — requires working directly with AWS; not self-service

Multi-language support

Dozens of languages and regional variants; language selection is per-voice, not per-request

VPC Endpoint support (PrivateLink)

Polly can be accessed privately within a VPC without traversing the public internet

CloudTrail integration

All Polly API calls are logged in AWS CloudTrail for auditing and compliance

KMS encryption for async output

S3 output from StartSpeechSynthesisTask can be encrypted with a customer-managed KMS key

Integration Patterns

Full Conversational AI Voice Loop

high freq

Amazon PollyAmazon Lex

Lex handles natural language understanding and dialog management; Polly converts Lex's text responses into spoken audio. Together they create complete voice-enabled chatbots and virtual assistants. Lex can natively call Polly for voice output in some configurations.

Bidirectional Voice Pipeline

high freq

Amazon PollyAmazon Transcribe

Transcribe converts user speech to text (STT); Polly converts application responses back to speech (TTS). This pair is the foundation of any voice application. Critical distinction: Transcribe = speech IN, Polly = speech OUT.

Async Audio Content Generation

high freq

Amazon PollyAmazon S3

StartSpeechSynthesisTask writes MP3/OGG files directly to a specified S3 bucket. S3 then serves as the distribution layer — files can be delivered via CloudFront for global low-latency audio streaming. Essential pattern for podcast/e-learning content generation pipelines.

Serverless On-Demand TTS

high freq

Amazon PollyAWS Lambda

Lambda functions invoke Polly's SynthesizeSpeech API in real time to generate audio on demand, streaming the result back to users or saving to S3. Enables event-driven TTS without managing servers.

IVR Voice Response System

medium freq

Amazon PollyAmazon Connect

Amazon Connect contact flows use Polly voices (including NTTS) to speak dynamic, personalized messages to callers. Polly voices are selectable directly within Connect contact flow blocks.

Sentiment-Aware Speech

medium freq

Amazon PollyAmazon Comprehend

Comprehend analyzes text sentiment; results inform SSML parameters passed to Polly (e.g., adjust speaking rate or emphasis based on detected emotion). Comprehend = text analysis; Polly = text output as audio.

Generative AI Voice Output

medium freq

Amazon PollyAmazon Bedrock

Bedrock LLMs generate dynamic text responses; Polly converts them to speech. This creates voice-enabled generative AI applications without managing TTS infrastructure.

Document-to-Audio Pipeline

medium freq

Amazon PollyAmazon TextractAmazon Polly

Textract extracts text from scanned documents or PDFs; the extracted text is passed to Polly to generate audio versions. Useful for accessibility — turning printed documents into audio books.

Global Audio CDN Distribution

low freq

Amazon PollyAmazon CloudFront

Pre-generated Polly audio files stored in S3 are distributed globally via CloudFront for low-latency audio delivery. Ideal for e-learning platforms with international users.

Service Limits & Quotas

LimitValueNote

SynthesizeSpeech — characters per request (real-time)

3,000 characters (billed characters)

The 3,000 limit applies to billed characters; SSML tags themselves are not billed but do count toward request size limits differently — a common source of confusion

StartSpeechSynthesisTask — characters per request (async)

100,000 characters (billed characters)

Output of async tasks is stored in S3; this is the primary architectural difference from synchronous SynthesizeSpeech

Lexicons per region

100 lexicons

Custom pronunciation lexicons are scoped per AWS region; you must upload them separately in each region you use

Lexicon size

4,000 characters per lexicon

Each individual lexicon file cannot exceed 4,000 characters — plan custom pronunciation dictionaries accordingly

Lexicons applied per SynthesizeSpeech request

5 lexicons

You can apply up to 5 lexicons simultaneously in a single synthesis request for custom pronunciations

Concurrent connections (SynthesizeSpeech)

Varies by account; soft limit, can be increased via Service Quotas concurrent requests

Default concurrency limits are soft limits — for high-throughput production systems, request a quota increase through AWS Service Quotas

Audio output formats supported

MP3, OGG Vorbis, PCM (raw audio), json (for speech marks) formats

JSON output format is used exclusively for speech marks (metadata about timing, word boundaries) — not audio itself; this trips up candidates

Pricing Model

Pay-per-use based on the number of characters synthesized

Standard voices: charged per million characters of text processed
Neural (NTTS) voices: charged at a higher rate per million characters than standard voices — approximately 4x the standard price
Free Tier: 5 million characters per month for standard voices and 1 million characters per month for neural voices for the first 12 months
Speech Marks requests are charged at the same rate as the corresponding voice type (standard or neural)
No minimum fees, no upfront commitments — pure pay-as-you-go
SSML tags are NOT billed as characters, but the text content within SSML tags IS billed
Async (StartSpeechSynthesisTask) and real-time (SynthesizeSpeech) are billed identically per character — no premium for async

Exam Tips

criticalAmazon Transcribe (STT) vs Amazon Polly (TTS)

Polly is TEXT → SPEECH ONLY. It NEVER converts speech to text. If a question asks about converting audio/voice to text, the answer is Amazon Transcribe, not Polly. This directional confusion is the #1 trap.

criticalSynchronous vs Asynchronous synthesis limits

For texts longer than 3,000 characters, you MUST use the asynchronous StartSpeechSynthesisTask API (supports up to 100,000 characters) with output delivered to S3. Real-time SynthesizeSpeech will fail or truncate beyond 3,000 characters.

criticalSAA-C03 cost optimization domain

Neural TTS (NTTS) produces more natural speech but costs ~4x more than standard voices. On cost-optimization questions, standard voices are preferred for high-volume, non-customer-facing use cases. NTTS is preferred for customer-facing applications where voice quality matters.

criticalEnd-to-end voice application architecture

The complete voice assistant architecture is: Transcribe (user speech → text) → Lex (understand intent) → Lambda (business logic) → Polly (response text → speech). Know this pipeline cold for SAA-C03 and SAP-C02.

critical

Polly = Text → Speech ONLY. ANY question about converting audio/speech to text = Amazon Transcribe. This directional confusion is the most common reason candidates lose Polly-related points.

critical

For content longer than 3,000 characters, ALWAYS use asynchronous StartSpeechSynthesisTask with S3 output — real-time SynthesizeSpeech cannot handle it. This is the most tested architectural decision for Polly.

critical

The complete voice assistant stack: Transcribe (STT) + Lex (NLU) + Lambda (logic) + Polly (TTS). Each service has ONE job. Polly ONLY speaks the final response — it understands nothing.

importantSSML speech control

SSML tags are NOT billed as characters but give you powerful control: use <break> for pauses, <emphasis> for stress, <prosody> for rate/pitch/volume, and <phoneme> for pronunciation. This is tested in architect-level questions about customizing voice output.

importantSpeech Marks for animation/karaoke

Speech Marks output format is JSON — it does NOT produce audio. It produces metadata (timing, word boundaries, visemes for lip-sync). If a question asks about synchronizing text highlights with audio playback, Speech Marks is the answer.

importantCustom pronunciation management

Custom Lexicons use the W3C PLS (Pronunciation Lexicon Specification) format and are REGIONAL — you must upload them to each AWS region where you use Polly. Up to 100 lexicons per region, up to 5 applied per request.

importantAmazon Connect IVR integration

Amazon Polly integrates natively with Amazon Connect for IVR systems. When a question describes a call center needing dynamic voice messages without pre-recording, the pattern is Connect + Polly (NTTS for most natural sound).

importantAIF-C01 AI/ML service layers

For the AIF-C01 exam: Polly is a pre-trained AI service (no model training required by the user). It sits in the 'AI Services' layer of AWS AI/ML stack — above SageMaker (build your own) and Bedrock (foundation models). You consume it via API without ML expertise.

Good to KnowNetwork security and compliance

Polly supports VPC Endpoints (PrivateLink) — for compliance scenarios where audio synthesis must not traverse the public internet, this is the correct architectural choice. Pair with KMS encryption for async S3 output.

Common Misconceptions & Traps

Common Mistake

Amazon Polly can transcribe speech/audio into text

Correct

Polly is exclusively TEXT-TO-SPEECH (TTS). It converts written text into spoken audio. The reverse operation — converting spoken audio to text — is performed by Amazon Transcribe.

This is the #1 Polly misconception on all certification exams. The names don't inherently signal direction. Memory trick: 'Polly SPEAKS' (output), 'Transcribe LISTENS' (input). On exams, any mention of converting audio/voice recordings to text = Transcribe, not Polly.

Common Mistake

Amazon Polly and Amazon Lex are interchangeable for building voice assistants

Correct

They serve completely different roles: Lex handles UNDERSTANDING (NLU, intent detection, dialog management). Polly handles OUTPUT (converting text responses to speech). A complete voice assistant needs BOTH — plus Transcribe for voice input.

Candidates confuse conversational AI (Lex) with speech synthesis (Polly). Lex cannot produce audio; Polly cannot understand language. On exams, if the requirement is 'understand what the user said,' the answer involves Lex. If the requirement is 'speak a response,' the answer involves Polly.

Common Mistake

Amazon Comprehend can be used instead of Polly for text-to-speech or audio generation

Correct

Amazon Comprehend is a Natural Language Processing (NLP) service that ANALYZES text — it extracts sentiment, entities, key phrases, and language. It produces structured data, not audio. Polly produces audio from text.

This misconception appears in questions about AI service selection. Comprehend = text analysis (input: text, output: insights). Polly = text synthesis (input: text, output: audio). They can be COMBINED (Comprehend analyzes sentiment → Polly speaks with appropriate tone via SSML) but are not substitutes.

Common Mistake

SSML tags count toward the billed character limit

Correct

SSML markup tags themselves (e.g., <break/>, <emphasis>, </prosody>) are NOT billed. Only the actual text content (the words being spoken) counts toward character billing.

This matters for cost estimation. A heavily SSML-annotated document with 2,000 words of text and 500 characters of SSML tags is billed for the 2,000 words only. Misunderstanding this leads to overestimating costs in exam cost-optimization scenarios.

Common Mistake

You can use a single Polly lexicon globally across all regions

Correct

Lexicons are REGIONAL resources. If you use Polly in us-east-1 and eu-west-1, you must upload your custom lexicons separately to each region. There is no global lexicon store.

This is an architectural trap in multi-region deployment questions. Failing to account for regional lexicon replication in a deployment pipeline will result in inconsistent pronunciation in non-primary regions.

Common Mistake

Neural TTS (NTTS) is always better and should always be chosen over Standard voices

Correct

NTTS produces more natural speech but costs approximately 4x more per character. For high-volume, non-customer-facing use cases (e.g., internal notifications, batch audio generation), Standard voices are the cost-optimal choice. NTTS is best for customer-facing applications where voice quality directly impacts user experience.

Cost-optimization questions will present scenarios where NTTS is technically superior but Standard is the correct answer because of cost constraints. Always evaluate the use case context — internal vs. customer-facing, volume, and budget.

Common Mistake

Amazon Polly requires machine learning expertise to use

Correct

Polly is a fully managed, pre-trained AI service accessible via API. No ML knowledge, model training, or data science expertise is required. You call the API with text and receive audio — the deep learning is entirely abstracted away.

On the AIF-C01 exam, understanding the distinction between AI Services (Polly, Rekognition, Transcribe — pre-trained, API-only), ML Platforms (SageMaker — build your own), and Foundation Model services (Bedrock) is critical. Polly sits firmly in the 'no ML expertise required' category.

Memory Tricks

🧠

POLLY SPEAKS, TRANSCRIBE LISTENS — Polly = Parrot (speaks text back), Transcribe = Secretary (writes down what it hears)

🧠

The Voice Pipeline: T-L-P (Transcribe → Lex → Polly) = 'TaLk to Polly' — user Talks (Transcribe), system Listens/thinks (Lex), Polly responds

🧠

3K sync, 100K async — 'Three thousand for fast, a hundred thousand for VAST (async tasks)'

🧠

LEX = Language EXpert (understands), POLLY = POLite LY speaker (outputs) — they're a team, not competitors

🧠

SSML = Super Speech Markup Language — tags are FREE, text is BILLED

CertAI Tutor · SAA-C03, SAP-C02, AIF-C01, CLF-C02 · 2026-02-22

Ready to test your knowledge?

Practice SAA-C03, SAP-C02, AIF-C01, CLF-C02 exam questions with AI-powered explanations — free to start.

Amazon Polly: The Voice of AWS — Text-to-Speech Mastery

Overview

Key Features

Integration Patterns

Service Limits & Quotas

Pricing Model

Exam Tips

Common Misconceptions & Traps

Memory Tricks

Ready to test your knowledge?

Related Cheat Sheets