> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Chunking and embeddings

> How documents become a queryable vector index.

This page is about what happens inside the `chunking → embedding` part of the ingestion pipeline. Understanding it will help you debug retrieval quality issues.

## Chunking strategies

Anyreach supports two strategies, picked per source:

### Fixed (default)

Splits the markdown text into chunks of a fixed character size at character boundaries.

| Parameter    | Default           |
| ------------ | ----------------- |
| `chunk_size` | `1000` characters |
| Overlap      | none              |

Pros: predictable, simple, fast.
Cons: a chunk can split a sentence or list mid-way.

### Structure-based

Respects the document's markdown structure — paragraphs, lists, code blocks, tables — and tries to keep semantic units intact while staying under the chunk size cap.

| Parameter    | Default                           |
| ------------ | --------------------------------- |
| `chunk_size` | `1000` characters (max per chunk) |

Pros: chunks are more coherent; retrieval quality on technical/structured docs is meaningfully better.
Cons: chunks vary in size; very long paragraphs may still get split.

### Picking a strategy

| Content type                   | Use             |
| ------------------------------ | --------------- |
| FAQs, support articles         | Structure-based |
| Long-form prose, narratives    | Fixed           |
| Technical specs, API docs      | Structure-based |
| Scanned PDFs (post-extraction) | Fixed           |

You can change the chunking strategy per source. Changing it triggers re-chunking and re-embedding for that source.

## Embedding models

Two OpenAI models are available; pick once per KB at creation time.

| Model                    | Dimensions | Cost (relative) | When to use                                                         |
| ------------------------ | ---------- | --------------- | ------------------------------------------------------------------- |
| `text-embedding-3-small` | 768        | 1×              | Default; works well for most content                                |
| `text-embedding-3-large` | 3072       | 6.5×            | Long-tail factual recall, dense technical content, multilingual KBs |

You can also pick **1536-dim** as a `text-embedding-3-small` configuration if you want denser vectors without going to the large model.

Switching models on an existing KB requires re-embedding every source. Plan accordingly — for a 1,000-source KB this can take several minutes.

## How retrieval scoring works

At query time:

1. The query text is embedded with the same model used by the KB.
2. Cosine similarity is computed against every chunk's embedding.
3. The top `top_n` chunks (by similarity) are returned.

There's no hard similarity threshold by default — the top `top_n` are returned regardless of absolute score. If you need a threshold, filter on the client side after a `POST /datasets/{id}/query`.

## Debugging retrieval

If callers ask a question whose answer is in your KB but the agent doesn't find it:

1. **Open the KB query tester.** From the KB page, click **Test query** and paste the caller's exact phrasing.
2. **Look at the returned chunks.** If the right chunk is in the list but ranked low, increase `top_n`. If it's not in the list at all, the chunk text doesn't carry enough signal — see below.
3. **Check the source.** Click into the source and view its chunks. Is the answer split across two chunks? Switch that source to structure-based chunking. Is the answer buried in nav/boilerplate text? Strip the boilerplate before re-uploading.

## Common retrieval failures

| Symptom                       | Likely cause                               | Fix                                                                     |
| ----------------------------- | ------------------------------------------ | ----------------------------------------------------------------------- |
| Right info, not retrieved     | Question phrased very differently from doc | Add a synonym-rich preamble to the chunk, or include FAQ-style rewrites |
| Boilerplate dominates results | Web pages with heavy nav                   | Strip nav before ingestion                                              |
| Answer split across chunks    | Fixed chunking on a structured doc         | Switch to structure-based                                               |
| Multilingual misses           | `small` embedding                          | Switch to `text-embedding-3-large`                                      |
