AWS Polly API
AWS Polly converts text to lifelike speech with 60+ voices across 30+ languages using standard and neural TTS engines — outputs MP3, OGG, or PCM audio in synchronous or asynchronous mode.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
IAM role-based access. No PII stored by Polly — text is not retained after synthesis. TLS in transit. HIPAA eligible (spoken health content requires BAA with AWS). FedRAMP authorized.
⚡ Reliability
Best When
You're already on AWS and need reliable, affordable TTS with SSML support, good language coverage, and async large-batch audio generation.
Avoid When
You need ultra-realistic voice cloning, real-time streaming TTS <100ms latency, or are not on AWS infrastructure.
Use Cases
- • Generating audio narration for agent-produced reports and content
- • Text-to-speech for IVR (interactive voice response) systems integrated with Amazon Connect
- • Creating voice responses for agent chatbots deployed via Alexa or Lex
- • Batch audio generation for e-learning content from text-based course materials
- • Real-time speech synthesis for accessibility features in agent-powered applications
Not For
- • Ultra-low latency TTS for real-time conversational AI (Cartesia or ElevenLabs are faster)
- • Voice cloning or custom voice training (Polly uses fixed voices, no custom models)
- • Teams not on AWS who need simpler API access (OpenAI TTS or ElevenLabs are easier)
Interface
Authentication
AWS SigV4 signing. IAM policies control SynthesizeSpeech and StartSpeechSynthesisTask actions. Async jobs write to S3 — requires S3 write permissions in addition to Polly permissions.
Pricing
Character-based pricing. Neural TTS is 4x more expensive but significantly better quality. Long Form neural voices have separate pricing.
Agent Metadata
Known Gotchas
- ⚠ SynthesizeSpeech input limit is 3,000 characters (billed characters) — agents must split long text with natural break points
- ⚠ SSML tags count toward character limit but differently than plain text — be precise about billing when mixing SSML and text
- ⚠ Async StartSpeechSynthesisTask saves to S3 — agents need both Polly and S3 write permissions, and must poll for completion
- ⚠ Neural voices are region-specific — some Neural voices are not available in all AWS regions
- ⚠ Response audio is in the HTTP response body as a stream — agents must buffer the entire response before playing or saving
Alternatives
Full Evaluation Report
Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for AWS Polly API.
Scores are editorial opinions as of 2026-03-06.