Uploading documents

Sources are the units of content in a knowledge base. Each source goes through an ingestion pipeline before it’s queryable.

Source types

Type	Format	Notes
File	`.pdf`, `.txt`, `.csv`, `.json`, `.md`	Parsed to markdown on upload
URL	Any HTTP(S) URL	Crawled — optionally following links. See Crawling URLs

Only .pdf, .txt, .csv, .json, and .md files are supported. Other file types (including .html) are rejected. To ingest web content, add it as a URL source instead so it is crawled and converted.

Upload a file

Open the KB

Knowledge Bases → click your KB.

Add source

Click Add source → File.

Drop in your file

Drag-and-drop or browse. Multiple files in one go is fine.

Set name and description (optional)

The filename is used as the default name. The description is shown in the KB UI but doesn’t affect retrieval.

Wait for processing

Status flows: pending → converting_to_markdown → chunking → embedding → ready. PDFs take longest to convert (PyMuPDF extraction).

Add a URL

URL sources are crawled at ingestion time. By default only the exact URL is fetched; enable crawling to follow links up to a depth and page limit. See Crawling URLs for every crawl option.

Add source → URL

Paste the URL.

Configure crawl options (optional)

Enable crawling to follow links up to a max depth and a max page count, with include/exclude path filters.

Wait for processing

Crawling is asynchronous — the source stays in progress until the crawl completes.

Pattern-based URL ingestion

For sites with predictable URL structures (e.g. /help/article/{slug}), use Add source → URL pattern to add many pages at once. See Knowledge Bases API usage.

Source status

After upload, a source is in one of these states:

Status	Meaning
`pending`	Queued for processing
`converting_to_markdown`	Extracting text (PDF) or converting crawled HTML
`chunking`	Splitting into retrievable units
`embedding`	Computing vectors
`ready`	Queryable
`failed`	Inspect the error and retry

Failure is most often caused by:

Scanned PDFs with no text layer (PyMuPDF can’t OCR)
URLs behind authentication
File encoding issues for .txt

Refreshing content

Source content is frozen at ingestion. Updating a page or replacing a file doesn’t update the knowledge base, and there is no in-place “refresh.” To refresh content, delete the source and add it again. For frequently-changing sites, re-run the crawl by re-adding the URL (or re-run a URL-pattern attach). See Managing content.

Best practices

One topic per KB. Don’t mix product docs and HR policies in one KB.
Smaller is better. A KB of 50 high-quality chunks outperforms one with 5,000 noisy chunks.
Strip navigation chrome from web sources. Headers, footers, and sidebars pollute retrieval. If you control the source pages, render a clean version.
Markdown is gold. If you can convert content to markdown ahead of time (e.g. from Notion or Confluence exports), retrieval quality is noticeably better than from PDFs.

​Source types

​Upload a file

​Add a URL

​Pattern-based URL ingestion

​Source status

​Refreshing content

​Best practices

Source types

Upload a file

Add a URL

Pattern-based URL ingestion

Source status

Refreshing content

Best practices