> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Voice and model configuration

> Choose the right voice, LLM, and speech recognizer for your agent.

An agent's perceived quality is dominated by three choices: how it sounds (TTS), how it understands you (STT), and how it thinks (LLM).

## TTS — text to speech

Anyreach supports three TTS providers.

### Cartesia Sonic-3

The default and recommended choice for most use cases.

| Field    | Default             | Range                                                      |
| -------- | ------------------- | ---------------------------------------------------------- |
| Voice ID | (provider-specific) | 100+ voices, 42 languages                                  |
| Speed    | `1.0`               | `0.6 – 1.5`                                                |
| Volume   | `1.0`               | `0.5 – 2.0`                                                |
| Emotion  | `neutral`           | `neutral`, `happy`, `sad`, `angry`, `surprised`, `curious` |

Strong points: low latency, large voice catalog, expressive emotion control.

### ElevenLabs

Best for premium-sounding voices and voice cloning.

Models: `eleven_turbo_v2`, `eleven_turbo_v2_5`

| Setting          | Default |
| ---------------- | ------- |
| Stability        | `0.5`   |
| Similarity boost | `0.75`  |
| Speed            | `1.0`   |

Trade-off: slightly higher latency than Cartesia in our measurements; usually worth it if you've cloned a brand voice.

### LiveKit Inference

Pass-through to any TTS available through LiveKit's inference plane. Specify the model string explicitly. Useful if you have a contract with a provider not natively supported.

## STT — speech to text

| Provider              | Model                              | Notes                                               |
| --------------------- | ---------------------------------- | --------------------------------------------------- |
| **Deepgram**          | `nova-2-general`, `nova-3-general` | Default. Robust, low latency, English-strong        |
| **Gladia**            | `solaria-1`                        | Strong multilingual + code-switching                |
| **AnyReach Default**  | (auto)                             | Resolves to a provider at runtime based on language |
| **LiveKit Inference** | custom string                      | Bring-your-own                                      |

Pick **Deepgram nova-3-general** for English-only agents. Pick **Gladia solaria-1** when callers may switch languages mid-utterance (common in customer support). Use **AnyReach Default** if you don't want to think about it.

## LLM — the model that drives the conversation

| Provider          | Models                                                                 |
| ----------------- | ---------------------------------------------------------------------- |
| OpenAI            | `gpt-4o`, `gpt-4.1`                                                    |
| Azure OpenAI      | `gpt-4o`, `gpt-4.1`                                                    |
| Google Gemini 2.5 | `flash`, `flash-lite`, `pro` (with `thinking_budget`, `top_p`/`top_k`) |
| Cerebras          | `llama-3.3-70b`, `zai-glm-4.7`                                         |
| Vercel AI Gateway | any model slug, e.g. `anthropic/claude-sonnet-4`                       |
| LiveKit Inference | custom model string                                                    |

### How to choose

* **Default**: `gpt-4o` (Azure) — best quality-per-latency trade-off for most use cases
* **Lowest latency**: Cerebras `llama-3.3-70b` — sub-second time to first token, good for high call volumes where latency dominates UX
* **Highest quality reasoning**: `gpt-4.1` — pick when the agent needs to follow complex multi-step instructions or reason about KB content carefully
* **Compliance**: prefer Azure OpenAI variants if you have data-residency requirements

## Stacks, fallbacks, and output format

A couple of structural choices apply across the stack:

* **Stack type.** An agent uses either an `stt_llm_tts` pipeline (separate speech recognition, LLM, and speech synthesis) or a `realtime` stack. Most agents use the pipeline stack.
* **Fallbacks.** Each of STT, LLM, and TTS accepts a model **fallback list** plus an `on_failure` policy (`attempt_timeout`, `max_retry`, `retry_interval`). If the primary model errors or times out, Anyreach fails over to the next model in the list.
* **Audio output format.** TTS output format is set per channel — for example ElevenLabs emits `pcm_8000` for telephony. You normally don't touch this; the platform picks a sane format per channel.

Provider-specific extras worth knowing: Deepgram supports `keywords`/`keyterms` boosting (disabled in multi-language mode), and Gladia supports code-switching and a processing `region` (`us-west`/`eu-west`).

## Latency budget

A natural-feeling conversation needs end-to-end response latency under \~800ms. Latency stacks roughly as:

```
caller stops talking
  + STT finalization      (~150ms)
  + LLM TTFT              (~300-500ms)
  + TTS first audio       (~150-300ms)
= agent starts speaking
```

If your agent feels sluggish, the LLM is usually the culprit. Try Cerebras Llama before tuning anything else.

## Where to set this

All of these live in the **Purpose & Personality** section (voice + language) and **Advanced Instructions** section (model overrides) on the agent edit page.
