TTS — text to speech
Anyreach supports three TTS providers.Cartesia Sonic-3
The default and recommended choice for most use cases.| Field | Default | Range |
|---|---|---|
| Voice ID | (provider-specific) | 100+ voices, 42 languages |
| Speed | 1.0 | 0.6 – 1.5 |
| Volume | 1.0 | 0.5 – 2.0 |
| Emotion | neutral | neutral, happy, sad, angry, surprised, curious |
ElevenLabs
Best for premium-sounding voices and voice cloning. Models:eleven_turbo_v2, eleven_turbo_v2_5
| Setting | Default |
|---|---|
| Stability | 0.5 |
| Similarity boost | 0.75 |
| Speed | 1.0 |
LiveKit Inference
Pass-through to any TTS available through LiveKit’s inference plane. Specify the model string explicitly. Useful if you have a contract with a provider not natively supported.STT — speech to text
| Provider | Model | Notes |
|---|---|---|
| Deepgram | nova-2-general, nova-3-general | Default. Robust, low latency, English-strong |
| Gladia | solaria-1 | Strong multilingual + code-switching |
| AnyReach Default | (auto) | Resolves to a provider at runtime based on language |
| LiveKit Inference | custom string | Bring-your-own |
LLM — the model that drives the conversation
| Provider | Models |
|---|---|
| OpenAI | gpt-4o, gpt-4.1 |
| Azure OpenAI | gpt-4o, gpt-4.1 |
| Google Gemini 2.5 | flash, flash-lite, pro (with thinking_budget, top_p/top_k) |
| Cerebras | llama-3.3-70b, zai-glm-4.7 |
| Vercel AI Gateway | any model slug, e.g. anthropic/claude-sonnet-4 |
| LiveKit Inference | custom model string |
How to choose
- Default:
gpt-4o(Azure) — best quality-per-latency trade-off for most use cases - Lowest latency: Cerebras
llama-3.3-70b— sub-second time to first token, good for high call volumes where latency dominates UX - Highest quality reasoning:
gpt-4.1— pick when the agent needs to follow complex multi-step instructions or reason about KB content carefully - Compliance: prefer Azure OpenAI variants if you have data-residency requirements
Stacks, fallbacks, and output format
A couple of structural choices apply across the stack:- Stack type. An agent uses either an
stt_llm_ttspipeline (separate speech recognition, LLM, and speech synthesis) or arealtimestack. Most agents use the pipeline stack. - Fallbacks. Each of STT, LLM, and TTS accepts a model fallback list plus an
on_failurepolicy (attempt_timeout,max_retry,retry_interval). If the primary model errors or times out, Anyreach fails over to the next model in the list. - Audio output format. TTS output format is set per channel — for example ElevenLabs emits
pcm_8000for telephony. You normally don’t touch this; the platform picks a sane format per channel.
keywords/keyterms boosting (disabled in multi-language mode), and Gladia supports code-switching and a processing region (us-west/eu-west).

