Skip to main content
Sources are the units of content in a knowledge base. Each source goes through an ingestion pipeline before it’s queryable.

Source types

TypeFormatNotes
File.pdf, .txt, .csv, .json, .mdParsed to markdown on upload
URLAny HTTP(S) URLCrawled — optionally following links. See Crawling URLs
Only .pdf, .txt, .csv, .json, and .md files are supported. Other file types (including .html) are rejected. To ingest web content, add it as a URL source instead so it is crawled and converted.

Upload a file

1

Open the KB

Knowledge Bases → click your KB.
2

Add source

Click Add source → File.
3

Drop in your file

Drag-and-drop or browse. Multiple files in one go is fine.
4

Set name and description (optional)

The filename is used as the default name. The description is shown in the KB UI but doesn’t affect retrieval.
5

Wait for processing

Status flows: pending → converting_to_markdown → chunking → embedding → ready. PDFs take longest to convert (PyMuPDF extraction).

Add a URL

URL sources are crawled at ingestion time. By default only the exact URL is fetched; enable crawling to follow links up to a depth and page limit. See Crawling URLs for every crawl option.
1

Add source → URL

Paste the URL.
2

Configure crawl options (optional)

Enable crawling to follow links up to a max depth and a max page count, with include/exclude path filters.
3

Wait for processing

Crawling is asynchronous — the source stays in progress until the crawl completes.

Pattern-based URL ingestion

For sites with predictable URL structures (e.g. /help/article/{slug}), use Add source → URL pattern to add many pages at once. See Knowledge Bases API usage.

Source status

After upload, a source is in one of these states:
StatusMeaning
pendingQueued for processing
converting_to_markdownExtracting text (PDF) or converting crawled HTML
chunkingSplitting into retrievable units
embeddingComputing vectors
readyQueryable
failedInspect the error and retry
Failure is most often caused by:
  • Scanned PDFs with no text layer (PyMuPDF can’t OCR)
  • URLs behind authentication
  • File encoding issues for .txt

Refreshing content

Source content is frozen at ingestion. Updating a page or replacing a file doesn’t update the knowledge base, and there is no in-place “refresh.” To refresh content, delete the source and add it again. For frequently-changing sites, re-run the crawl by re-adding the URL (or re-run a URL-pattern attach). See Managing content.

Best practices

  • One topic per KB. Don’t mix product docs and HR policies in one KB.
  • Smaller is better. A KB of 50 high-quality chunks outperforms one with 5,000 noisy chunks.
  • Strip navigation chrome from web sources. Headers, footers, and sidebars pollute retrieval. If you control the source pages, render a clean version.
  • Markdown is gold. If you can convert content to markdown ahead of time (e.g. from Notion or Confluence exports), retrieval quality is noticeably better than from PDFs.