chunking → embedding part of the ingestion pipeline. Understanding it will help you debug retrieval quality issues.
Chunking strategies
Anyreach supports two strategies, picked per source:Fixed (default)
Splits the markdown text into chunks of a fixed character size at character boundaries.| Parameter | Default |
|---|---|
chunk_size | 1000 characters |
| Overlap | none |
Structure-based
Respects the document’s markdown structure — paragraphs, lists, code blocks, tables — and tries to keep semantic units intact while staying under the chunk size cap.| Parameter | Default |
|---|---|
chunk_size | 1000 characters (max per chunk) |
Picking a strategy
| Content type | Use |
|---|---|
| FAQs, support articles | Structure-based |
| Long-form prose, narratives | Fixed |
| Technical specs, API docs | Structure-based |
| Scanned PDFs (post-extraction) | Fixed |
Embedding models
Two OpenAI models are available; pick once per KB at creation time.| Model | Dimensions | Cost (relative) | When to use |
|---|---|---|---|
text-embedding-3-small | 768 | 1× | Default; works well for most content |
text-embedding-3-large | 3072 | 6.5× | Long-tail factual recall, dense technical content, multilingual KBs |
text-embedding-3-small configuration if you want denser vectors without going to the large model.
Switching models on an existing KB requires re-embedding every source. Plan accordingly — for a 1,000-source KB this can take several minutes.
How retrieval scoring works
At query time:- The query text is embedded with the same model used by the KB.
- Cosine similarity is computed against every chunk’s embedding.
- The top
top_nchunks (by similarity) are returned.
top_n are returned regardless of absolute score. If you need a threshold, filter on the client side after a POST /datasets/{id}/query.
Debugging retrieval
If callers ask a question whose answer is in your KB but the agent doesn’t find it:- Open the KB query tester. From the KB page, click Test query and paste the caller’s exact phrasing.
- Look at the returned chunks. If the right chunk is in the list but ranked low, increase
top_n. If it’s not in the list at all, the chunk text doesn’t carry enough signal — see below. - Check the source. Click into the source and view its chunks. Is the answer split across two chunks? Switch that source to structure-based chunking. Is the answer buried in nav/boilerplate text? Strip the boilerplate before re-uploading.
Common retrieval failures
| Symptom | Likely cause | Fix |
|---|---|---|
| Right info, not retrieved | Question phrased very differently from doc | Add a synonym-rich preamble to the chunk, or include FAQ-style rewrites |
| Boilerplate dominates results | Web pages with heavy nav | Strip nav before ingestion |
| Answer split across chunks | Fixed chunking on a structured doc | Switch to structure-based |
| Multilingual misses | small embedding | Switch to text-embedding-3-large |

