> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Knowledge bases overview

> Ground your agents in your own content with retrieval-augmented generation.

A **knowledge base (KB)** is a vector index of your content — uploaded files or crawled URLs — that agents query in real time to answer factual questions.

## When to use one

Use a KB when:

* Callers ask factual questions whose answers live in static documents (FAQs, manuals, policy pages)
* The content is too long to fit in the agent's system prompt
* The content changes infrequently — daily or less

Don't use a KB for:

* Per-call state (the caller's name, prior answers in this call) — that lives in conversation context
* Real-time data (today's stock price, current order status) — use a workflow tool instead
* Tiny content (a 2-paragraph product description) — just put it in the prompt

## How it works

```
1. Add sources         (PDF, TXT, CSV, JSON, or MD file, or a URL)
2. Convert to markdown  (PyMuPDF for PDFs; crawl + convert for URLs)
3. Chunk                (default: fixed 1000-character chunks)
4. Embed                (OpenAI text-embedding-3-small or -large)
5. Store                (PostgreSQL with pgvector)

At query time:
6. Embed the query turn
7. Vector similarity search
8. Return top_n chunks (default 5)
```

You don't have to think about steps 2–5 — the platform runs them automatically when you add a source. You see the source's progress through the pipeline status: `pending → converting_to_markdown → chunking → embedding → ready`.

## Datasets and sources

The dashboard says "Knowledge Base," but the API and database use three related nouns:

| Term               | What it is                                                                                                                                                                             |
| ------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Dataset**        | A knowledge base. Identified by a `dataset_id`.                                                                                                                                        |
| **Source**         | A file or URL you ingested. Has its own upload state.                                                                                                                                  |
| **Dataset source** | The attachment of a source to a KB. Carries its own per-KB processing status (`converting_to_markdown → chunking → embedding → ready`) and a `dataset_source_id` you use to detach it. |

The same source can be attached to several knowledge bases.

## Model choices

Pick an embedding model and dimension when you create the KB:

| Model                    | Dimensions    | Use when                                                                        |
| ------------------------ | ------------- | ------------------------------------------------------------------------------- |
| `text-embedding-3-small` | 768 (default) | Most use cases. Faster, cheaper.                                                |
| `text-embedding-3-large` | 1536 or 3072  | Long-tail factual recall, technical content where the small model misses nuance |

You can change the model later, but doing so requires re-embedding every source.

## Supported content

* **File types**: `.pdf`, `.txt`, `.csv`, `.json`, `.md`. Other types (including `.html`) are rejected — ingest web content as a **URL** source instead.
* **URLs**: crawled, optionally following links. See [Crawling URLs](/knowledge-bases/crawling-urls).
* **Sources per KB**: governed by your plan.
