Source types
| Type | Format | Notes |
|---|---|---|
| File | .pdf, .txt, .csv, .json, .md | Parsed to markdown on upload |
| URL | Any HTTP(S) URL | Crawled — optionally following links. See Crawling URLs |
Only
.pdf, .txt, .csv, .json, and .md files are supported. Other file types (including .html) are rejected. To ingest web content, add it as a URL source instead so it is crawled and converted.Upload a file
Set name and description (optional)
The filename is used as the default name. The description is shown in the KB UI but doesn’t affect retrieval.
Add a URL
URL sources are crawled at ingestion time. By default only the exact URL is fetched; enable crawling to follow links up to a depth and page limit. See Crawling URLs for every crawl option.Configure crawl options (optional)
Enable crawling to follow links up to a max depth and a max page count, with include/exclude path filters.
Pattern-based URL ingestion
For sites with predictable URL structures (e.g./help/article/{slug}), use Add source → URL pattern to add many pages at once. See Knowledge Bases API usage.
Source status
After upload, a source is in one of these states:| Status | Meaning |
|---|---|
pending | Queued for processing |
converting_to_markdown | Extracting text (PDF) or converting crawled HTML |
chunking | Splitting into retrievable units |
embedding | Computing vectors |
ready | Queryable |
failed | Inspect the error and retry |
- Scanned PDFs with no text layer (PyMuPDF can’t OCR)
- URLs behind authentication
- File encoding issues for
.txt
Refreshing content
Source content is frozen at ingestion. Updating a page or replacing a file doesn’t update the knowledge base, and there is no in-place “refresh.” To refresh content, delete the source and add it again. For frequently-changing sites, re-run the crawl by re-adding the URL (or re-run a URL-pattern attach). See Managing content.Best practices
- One topic per KB. Don’t mix product docs and HR policies in one KB.
- Smaller is better. A KB of 50 high-quality chunks outperforms one with 5,000 noisy chunks.
- Strip navigation chrome from web sources. Headers, footers, and sidebars pollute retrieval. If you control the source pages, render a clean version.
- Markdown is gold. If you can convert content to markdown ahead of time (e.g. from Notion or Confluence exports), retrieval quality is noticeably better than from PDFs.

