Voice and model configuration

An agent’s perceived quality is dominated by three choices: how it sounds (TTS), how it understands you (STT), and how it thinks (LLM).

TTS — text to speech

Anyreach supports three TTS providers.

Cartesia Sonic-3

The default and recommended choice for most use cases.

Field	Default	Range
Voice ID	(provider-specific)	100+ voices, 42 languages
Speed	`1.0`	`0.6 – 1.5`
Volume	`1.0`	`0.5 – 2.0`
Emotion	`neutral`	`neutral`, `happy`, `sad`, `angry`, `surprised`, `curious`

Strong points: low latency, large voice catalog, expressive emotion control.

ElevenLabs

Best for premium-sounding voices and voice cloning. Models: eleven_turbo_v2, eleven_turbo_v2_5

Setting	Default
Stability	`0.5`
Similarity boost	`0.75`
Speed	`1.0`

Trade-off: slightly higher latency than Cartesia in our measurements; usually worth it if you’ve cloned a brand voice.

LiveKit Inference

Pass-through to any TTS available through LiveKit’s inference plane. Specify the model string explicitly. Useful if you have a contract with a provider not natively supported.

STT — speech to text

Provider	Model	Notes
Deepgram	`nova-2-general`, `nova-3-general`	Default. Robust, low latency, English-strong
Gladia	`solaria-1`	Strong multilingual + code-switching
AnyReach Default	(auto)	Resolves to a provider at runtime based on language
LiveKit Inference	custom string	Bring-your-own

Pick Deepgram nova-3-general for English-only agents. Pick Gladia solaria-1 when callers may switch languages mid-utterance (common in customer support). Use AnyReach Default if you don’t want to think about it.

LLM — the model that drives the conversation

Provider	Models
OpenAI	`gpt-4o`, `gpt-4.1`
Azure OpenAI	`gpt-4o`, `gpt-4.1`
Google Gemini 2.5	`flash`, `flash-lite`, `pro` (with `thinking_budget`, `top_p`/`top_k`)
Cerebras	`llama-3.3-70b`, `zai-glm-4.7`
Vercel AI Gateway	any model slug, e.g. `anthropic/claude-sonnet-4`
LiveKit Inference	custom model string

How to choose

Default: gpt-4o (Azure) — best quality-per-latency trade-off for most use cases
Lowest latency: Cerebras llama-3.3-70b — sub-second time to first token, good for high call volumes where latency dominates UX
Highest quality reasoning: gpt-4.1 — pick when the agent needs to follow complex multi-step instructions or reason about KB content carefully
Compliance: prefer Azure OpenAI variants if you have data-residency requirements

Stacks, fallbacks, and output format

A couple of structural choices apply across the stack:

Stack type. An agent uses either an stt_llm_tts pipeline (separate speech recognition, LLM, and speech synthesis) or a realtime stack. Most agents use the pipeline stack.
Fallbacks. Each of STT, LLM, and TTS accepts a model fallback list plus an on_failure policy (attempt_timeout, max_retry, retry_interval). If the primary model errors or times out, Anyreach fails over to the next model in the list.
Audio output format. TTS output format is set per channel — for example ElevenLabs emits pcm_8000 for telephony. You normally don’t touch this; the platform picks a sane format per channel.

Provider-specific extras worth knowing: Deepgram supports keywords/keyterms boosting (disabled in multi-language mode), and Gladia supports code-switching and a processing region (us-west/eu-west).

Latency budget

A natural-feeling conversation needs end-to-end response latency under ~800ms. Latency stacks roughly as:

caller stops talking
  + STT finalization      (~150ms)
  + LLM TTFT              (~300-500ms)
  + TTS first audio       (~150-300ms)
= agent starts speaking

If your agent feels sluggish, the LLM is usually the culprit. Try Cerebras Llama before tuning anything else.

Where to set this

All of these live in the Purpose & Personality section (voice + language) and Advanced Instructions section (model overrides) on the agent edit page.

​TTS — text to speech

​Cartesia Sonic-3

​ElevenLabs

​LiveKit Inference

​STT — speech to text

​LLM — the model that drives the conversation

​How to choose

​Stacks, fallbacks, and output format

​Latency budget

​Where to set this