Chunking and embeddings

This page is about what happens inside the chunking → embedding part of the ingestion pipeline. Understanding it will help you debug retrieval quality issues.

Chunking strategies

Anyreach supports two strategies, picked per source:

Fixed (default)

Splits the markdown text into chunks of a fixed character size at character boundaries.

Parameter	Default
`chunk_size`	`1000` characters
Overlap	none

Pros: predictable, simple, fast. Cons: a chunk can split a sentence or list mid-way.

Structure-based

Respects the document’s markdown structure — paragraphs, lists, code blocks, tables — and tries to keep semantic units intact while staying under the chunk size cap.

Parameter	Default
`chunk_size`	`1000` characters (max per chunk)

Pros: chunks are more coherent; retrieval quality on technical/structured docs is meaningfully better. Cons: chunks vary in size; very long paragraphs may still get split.

Picking a strategy

Content type	Use
FAQs, support articles	Structure-based
Long-form prose, narratives	Fixed
Technical specs, API docs	Structure-based
Scanned PDFs (post-extraction)	Fixed

You can change the chunking strategy per source. Changing it triggers re-chunking and re-embedding for that source.

Embedding models

Two OpenAI models are available; pick once per KB at creation time.

Model	Dimensions	Cost (relative)	When to use
`text-embedding-3-small`	768	1×	Default; works well for most content
`text-embedding-3-large`	3072	6.5×	Long-tail factual recall, dense technical content, multilingual KBs

You can also pick 1536-dim as a text-embedding-3-small configuration if you want denser vectors without going to the large model. Switching models on an existing KB requires re-embedding every source. Plan accordingly — for a 1,000-source KB this can take several minutes.

How retrieval scoring works

At query time:

The query text is embedded with the same model used by the KB.
Cosine similarity is computed against every chunk’s embedding.
The top top_n chunks (by similarity) are returned.

There’s no hard similarity threshold by default — the top top_n are returned regardless of absolute score. If you need a threshold, filter on the client side after a POST /datasets/{id}/query.

Debugging retrieval

If callers ask a question whose answer is in your KB but the agent doesn’t find it:

Open the KB query tester. From the KB page, click Test query and paste the caller’s exact phrasing.
Look at the returned chunks. If the right chunk is in the list but ranked low, increase top_n. If it’s not in the list at all, the chunk text doesn’t carry enough signal — see below.
Check the source. Click into the source and view its chunks. Is the answer split across two chunks? Switch that source to structure-based chunking. Is the answer buried in nav/boilerplate text? Strip the boilerplate before re-uploading.

Common retrieval failures

Symptom	Likely cause	Fix
Right info, not retrieved	Question phrased very differently from doc	Add a synonym-rich preamble to the chunk, or include FAQ-style rewrites
Boilerplate dominates results	Web pages with heavy nav	Strip nav before ingestion
Answer split across chunks	Fixed chunking on a structured doc	Switch to structure-based
Multilingual misses	`small` embedding	Switch to `text-embedding-3-large`

​Chunking strategies

​Fixed (default)

​Structure-based

​Picking a strategy

​Embedding models

​How retrieval scoring works

​Debugging retrieval

​Common retrieval failures