Knowledge bases overview

A knowledge base (KB) is a vector index of your content — uploaded files or crawled URLs — that agents query in real time to answer factual questions.

When to use one

Use a KB when:

Callers ask factual questions whose answers live in static documents (FAQs, manuals, policy pages)
The content is too long to fit in the agent’s system prompt
The content changes infrequently — daily or less

Don’t use a KB for:

Per-call state (the caller’s name, prior answers in this call) — that lives in conversation context
Real-time data (today’s stock price, current order status) — use a workflow tool instead
Tiny content (a 2-paragraph product description) — just put it in the prompt

How it works

Add sources         (PDF, TXT, CSV, JSON, or MD file, or a URL)
Convert to markdown  (PyMuPDF for PDFs; crawl + convert for URLs)
Chunk                (default: fixed 1000-character chunks)
Embed                (OpenAI text-embedding-3-small or -large)
Store                (PostgreSQL with pgvector)

At query time:
Embed the query turn
Vector similarity search
Return top_n chunks (default 5)

You don’t have to think about steps 2–5 — the platform runs them automatically when you add a source. You see the source’s progress through the pipeline status: pending → converting_to_markdown → chunking → embedding → ready.

Datasets and sources

The dashboard says “Knowledge Base,” but the API and database use three related nouns:

Term	What it is
Dataset	A knowledge base. Identified by a `dataset_id`.
Source	A file or URL you ingested. Has its own upload state.
Dataset source	The attachment of a source to a KB. Carries its own per-KB processing status (`converting_to_markdown → chunking → embedding → ready`) and a `dataset_source_id` you use to detach it.

The same source can be attached to several knowledge bases.

Model choices

Pick an embedding model and dimension when you create the KB:

Model	Dimensions	Use when
`text-embedding-3-small`	768 (default)	Most use cases. Faster, cheaper.
`text-embedding-3-large`	1536 or 3072	Long-tail factual recall, technical content where the small model misses nuance

You can change the model later, but doing so requires re-embedding every source.

Supported content

File types: .pdf, .txt, .csv, .json, .md. Other types (including .html) are rejected — ingest web content as a URL source instead.
URLs: crawled, optionally following links. See Crawling URLs.
Sources per KB: governed by your plan.

​When to use one

​How it works

​Datasets and sources

​Model choices

​Supported content

When to use one

How it works

Datasets and sources

Model choices

Supported content