Skip to main content
A knowledge base (KB) is a vector index of your content — uploaded files or crawled URLs — that agents query in real time to answer factual questions.

When to use one

Use a KB when:
  • Callers ask factual questions whose answers live in static documents (FAQs, manuals, policy pages)
  • The content is too long to fit in the agent’s system prompt
  • The content changes infrequently — daily or less
Don’t use a KB for:
  • Per-call state (the caller’s name, prior answers in this call) — that lives in conversation context
  • Real-time data (today’s stock price, current order status) — use a workflow tool instead
  • Tiny content (a 2-paragraph product description) — just put it in the prompt

How it works

1. Add sources         (PDF, TXT, CSV, JSON, or MD file, or a URL)
2. Convert to markdown  (PyMuPDF for PDFs; crawl + convert for URLs)
3. Chunk                (default: fixed 1000-character chunks)
4. Embed                (OpenAI text-embedding-3-small or -large)
5. Store                (PostgreSQL with pgvector)

At query time:
6. Embed the query turn
7. Vector similarity search
8. Return top_n chunks (default 5)
You don’t have to think about steps 2–5 — the platform runs them automatically when you add a source. You see the source’s progress through the pipeline status: pending → converting_to_markdown → chunking → embedding → ready.

Datasets and sources

The dashboard says “Knowledge Base,” but the API and database use three related nouns:
TermWhat it is
DatasetA knowledge base. Identified by a dataset_id.
SourceA file or URL you ingested. Has its own upload state.
Dataset sourceThe attachment of a source to a KB. Carries its own per-KB processing status (converting_to_markdown → chunking → embedding → ready) and a dataset_source_id you use to detach it.
The same source can be attached to several knowledge bases.

Model choices

Pick an embedding model and dimension when you create the KB:
ModelDimensionsUse when
text-embedding-3-small768 (default)Most use cases. Faster, cheaper.
text-embedding-3-large1536 or 3072Long-tail factual recall, technical content where the small model misses nuance
You can change the model later, but doing so requires re-embedding every source.

Supported content

  • File types: .pdf, .txt, .csv, .json, .md. Other types (including .html) are rejected — ingest web content as a URL source instead.
  • URLs: crawled, optionally following links. See Crawling URLs.
  • Sources per KB: governed by your plan.