Skip to main content
This page is about what happens inside the chunking → embedding part of the ingestion pipeline. Understanding it will help you debug retrieval quality issues.

Chunking strategies

Anyreach supports two strategies, picked per source:

Fixed (default)

Splits the markdown text into chunks of a fixed character size at character boundaries.
ParameterDefault
chunk_size1000 characters
Overlapnone
Pros: predictable, simple, fast. Cons: a chunk can split a sentence or list mid-way.

Structure-based

Respects the document’s markdown structure — paragraphs, lists, code blocks, tables — and tries to keep semantic units intact while staying under the chunk size cap.
ParameterDefault
chunk_size1000 characters (max per chunk)
Pros: chunks are more coherent; retrieval quality on technical/structured docs is meaningfully better. Cons: chunks vary in size; very long paragraphs may still get split.

Picking a strategy

Content typeUse
FAQs, support articlesStructure-based
Long-form prose, narrativesFixed
Technical specs, API docsStructure-based
Scanned PDFs (post-extraction)Fixed
You can change the chunking strategy per source. Changing it triggers re-chunking and re-embedding for that source.

Embedding models

Two OpenAI models are available; pick once per KB at creation time.
ModelDimensionsCost (relative)When to use
text-embedding-3-small768Default; works well for most content
text-embedding-3-large30726.5×Long-tail factual recall, dense technical content, multilingual KBs
You can also pick 1536-dim as a text-embedding-3-small configuration if you want denser vectors without going to the large model. Switching models on an existing KB requires re-embedding every source. Plan accordingly — for a 1,000-source KB this can take several minutes.

How retrieval scoring works

At query time:
  1. The query text is embedded with the same model used by the KB.
  2. Cosine similarity is computed against every chunk’s embedding.
  3. The top top_n chunks (by similarity) are returned.
There’s no hard similarity threshold by default — the top top_n are returned regardless of absolute score. If you need a threshold, filter on the client side after a POST /datasets/{id}/query.

Debugging retrieval

If callers ask a question whose answer is in your KB but the agent doesn’t find it:
  1. Open the KB query tester. From the KB page, click Test query and paste the caller’s exact phrasing.
  2. Look at the returned chunks. If the right chunk is in the list but ranked low, increase top_n. If it’s not in the list at all, the chunk text doesn’t carry enough signal — see below.
  3. Check the source. Click into the source and view its chunks. Is the answer split across two chunks? Switch that source to structure-based chunking. Is the answer buried in nav/boilerplate text? Strip the boilerplate before re-uploading.

Common retrieval failures

SymptomLikely causeFix
Right info, not retrievedQuestion phrased very differently from docAdd a synonym-rich preamble to the chunk, or include FAQ-style rewrites
Boilerplate dominates resultsWeb pages with heavy navStrip nav before ingestion
Answer split across chunksFixed chunking on a structured docSwitch to structure-based
Multilingual missessmall embeddingSwitch to text-embedding-3-large