> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Uploading documents

> Add files and URLs to a knowledge base.

Sources are the units of content in a knowledge base. Each source goes through an ingestion pipeline before it's queryable.

## Source types

| Type     | Format                                 | Notes                                                                                     |
| -------- | -------------------------------------- | ----------------------------------------------------------------------------------------- |
| **File** | `.pdf`, `.txt`, `.csv`, `.json`, `.md` | Parsed to markdown on upload                                                              |
| **URL**  | Any HTTP(S) URL                        | Crawled — optionally following links. See [Crawling URLs](/knowledge-bases/crawling-urls) |

<Note>
  Only `.pdf`, `.txt`, `.csv`, `.json`, and `.md` files are supported. Other file types (including `.html`) are rejected. To ingest web content, add it as a **URL** source instead so it is crawled and converted.
</Note>

## Upload a file

<Steps>
  <Step title="Open the KB">
    Knowledge Bases → click your KB.
  </Step>

  <Step title="Add source">
    Click **Add source → File**.
  </Step>

  <Step title="Drop in your file">
    Drag-and-drop or browse. Multiple files in one go is fine.
  </Step>

  <Step title="Set name and description (optional)">
    The filename is used as the default name. The description is shown in the KB UI but doesn't affect retrieval.
  </Step>

  <Step title="Wait for processing">
    Status flows: `pending → converting_to_markdown → chunking → embedding → ready`. PDFs take longest to convert (PyMuPDF extraction).
  </Step>
</Steps>

## Add a URL

URL sources are crawled at ingestion time. By default only the exact URL is fetched; enable crawling to follow links up to a depth and page limit. See [Crawling URLs](/knowledge-bases/crawling-urls) for every crawl option.

<Steps>
  <Step title="Add source → URL">
    Paste the URL.
  </Step>

  <Step title="Configure crawl options (optional)">
    Enable crawling to follow links up to a max depth and a max page count, with include/exclude path filters.
  </Step>

  <Step title="Wait for processing">
    Crawling is asynchronous — the source stays in progress until the crawl completes.
  </Step>
</Steps>

### Pattern-based URL ingestion

For sites with predictable URL structures (e.g. `/help/article/{slug}`), use **Add source → URL pattern** to add many pages at once. See [Knowledge Bases API usage](/knowledge-bases/api-usage).

## Source status

After upload, a source is in one of these states:

| Status                   | Meaning                                          |
| ------------------------ | ------------------------------------------------ |
| `pending`                | Queued for processing                            |
| `converting_to_markdown` | Extracting text (PDF) or converting crawled HTML |
| `chunking`               | Splitting into retrievable units                 |
| `embedding`              | Computing vectors                                |
| `ready`                  | Queryable                                        |
| `failed`                 | Inspect the error and retry                      |

Failure is most often caused by:

* Scanned PDFs with no text layer (PyMuPDF can't OCR)
* URLs behind authentication
* File encoding issues for `.txt`

## Refreshing content

Source content is **frozen at ingestion**. Updating a page or replacing a file doesn't update the knowledge base, and there is no in-place "refresh."

To refresh content, **delete the source and add it again**. For frequently-changing sites, re-run the crawl by re-adding the URL (or re-run a URL-pattern attach). See [Managing content](/knowledge-bases/managing-content).

## Best practices

* **One topic per KB.** Don't mix product docs and HR policies in one KB.
* **Smaller is better.** A KB of 50 high-quality chunks outperforms one with 5,000 noisy chunks.
* **Strip navigation chrome from web sources.** Headers, footers, and sidebars pollute retrieval. If you control the source pages, render a clean version.
* **Markdown is gold.** If you can convert content to markdown ahead of time (e.g. from Notion or Confluence exports), retrieval quality is noticeably better than from PDFs.
