Skip to main content
An agent’s perceived quality is dominated by three choices: how it sounds (TTS), how it understands you (STT), and how it thinks (LLM).

TTS — text to speech

Anyreach supports three TTS providers.

Cartesia Sonic-3

The default and recommended choice for most use cases.
FieldDefaultRange
Voice ID(provider-specific)100+ voices, 42 languages
Speed1.00.6 – 1.5
Volume1.00.5 – 2.0
Emotionneutralneutral, happy, sad, angry, surprised, curious
Strong points: low latency, large voice catalog, expressive emotion control.

ElevenLabs

Best for premium-sounding voices and voice cloning. Models: eleven_turbo_v2, eleven_turbo_v2_5
SettingDefault
Stability0.5
Similarity boost0.75
Speed1.0
Trade-off: slightly higher latency than Cartesia in our measurements; usually worth it if you’ve cloned a brand voice.

LiveKit Inference

Pass-through to any TTS available through LiveKit’s inference plane. Specify the model string explicitly. Useful if you have a contract with a provider not natively supported.

STT — speech to text

ProviderModelNotes
Deepgramnova-2-general, nova-3-generalDefault. Robust, low latency, English-strong
Gladiasolaria-1Strong multilingual + code-switching
AnyReach Default(auto)Resolves to a provider at runtime based on language
LiveKit Inferencecustom stringBring-your-own
Pick Deepgram nova-3-general for English-only agents. Pick Gladia solaria-1 when callers may switch languages mid-utterance (common in customer support). Use AnyReach Default if you don’t want to think about it.

LLM — the model that drives the conversation

ProviderModels
OpenAIgpt-4o, gpt-4.1
Azure OpenAIgpt-4o, gpt-4.1
Google Gemini 2.5flash, flash-lite, pro (with thinking_budget, top_p/top_k)
Cerebrasllama-3.3-70b, zai-glm-4.7
Vercel AI Gatewayany model slug, e.g. anthropic/claude-sonnet-4
LiveKit Inferencecustom model string

How to choose

  • Default: gpt-4o (Azure) — best quality-per-latency trade-off for most use cases
  • Lowest latency: Cerebras llama-3.3-70b — sub-second time to first token, good for high call volumes where latency dominates UX
  • Highest quality reasoning: gpt-4.1 — pick when the agent needs to follow complex multi-step instructions or reason about KB content carefully
  • Compliance: prefer Azure OpenAI variants if you have data-residency requirements

Stacks, fallbacks, and output format

A couple of structural choices apply across the stack:
  • Stack type. An agent uses either an stt_llm_tts pipeline (separate speech recognition, LLM, and speech synthesis) or a realtime stack. Most agents use the pipeline stack.
  • Fallbacks. Each of STT, LLM, and TTS accepts a model fallback list plus an on_failure policy (attempt_timeout, max_retry, retry_interval). If the primary model errors or times out, Anyreach fails over to the next model in the list.
  • Audio output format. TTS output format is set per channel — for example ElevenLabs emits pcm_8000 for telephony. You normally don’t touch this; the platform picks a sane format per channel.
Provider-specific extras worth knowing: Deepgram supports keywords/keyterms boosting (disabled in multi-language mode), and Gladia supports code-switching and a processing region (us-west/eu-west).

Latency budget

A natural-feeling conversation needs end-to-end response latency under ~800ms. Latency stacks roughly as:
caller stops talking
  + STT finalization      (~150ms)
  + LLM TTFT              (~300-500ms)
  + TTS first audio       (~150-300ms)
= agent starts speaking
If your agent feels sluggish, the LLM is usually the culprit. Try Cerebras Llama before tuning anything else.

Where to set this

All of these live in the Purpose & Personality section (voice + language) and Advanced Instructions section (model overrides) on the agent edit page.