> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Crawling URLs

> Ingest website content into a knowledge base by crawling URLs.

A URL source crawls a website and ingests its page content into a knowledge base for retrieval. You give Anyreach a starting URL and crawl options; the platform follows links, extracts the main text of each page as Markdown, and stores it as searchable sources.

Crawling is asynchronous. Creating a URL source kicks off a background crawl job and returns immediately — the crawl completes later when the crawl provider calls back via webhook.

## Create a URL source

Send a `URL` source to the sources endpoint. Each source carries an optional `url_crawl_options` object that controls the crawl.

```bash theme={null}
curl -X POST https://api.anyreach.ai/knowledge-base/sources \
  -H "Authorization: Bearer <token>" \
  -H "X-Anyreach-Org: <organization_id>" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "type": "URL",
      "name": "https://example.com/docs",
      "url_crawl_options": {
        "limit": 10,
        "max_depth": 3,
        "add_to_datasets": ["<dataset_id>"]
      }
    }
  ]'
```

The `name` must be a valid URL when `type` is `URL`; the request fails validation otherwise. The `domain` is derived automatically from the URL. Creating sources requires the `sources:manage` scope.

<Note>
  The endpoint accepts a list, so you can submit several URLs in one request. Each `URL` source starts its own crawl job.
</Note>

## Crawl lifecycle

```
POST /knowledge-base/sources (type=URL)
        │
        ▼
 parent source created (file_upload_status=PENDING)
        │
        ▼
 crawl job submitted to provider ──► returns immediately
        │
        ▼  (async)
 provider crawls pages, calls webhook per page + on completion
        │
        ▼
 one child source per discovered page (PENDING → COMPLETE)
 child sources auto-attached to add_to_datasets
```

Because the crawl runs in the background, a freshly created URL source has no content yet. Page sources appear and flip to `COMPLETE` as the provider reports each crawled page through the webhook, and the crawl job itself is marked `COMPLETE` when the provider finishes. If the crawl fails outright, the source is marked `FAILED`.

The `file_upload_status` field on a source uses these values:

| Status        | Meaning                                   |
| ------------- | ----------------------------------------- |
| `PENDING`     | Source created; content not yet ingested. |
| `IN_PROGRESS` | Content is being processed.               |
| `COMPLETE`    | Content ingested and ready for retrieval. |
| `FAILED`      | Crawl or processing failed.               |

## Crawl options

Set these fields under `url_crawl_options`. All are optional.

| Field                 | Type      | Default     | Description                                             |
| --------------------- | --------- | ----------- | ------------------------------------------------------- |
| `limit`               | integer   | `10`        | Maximum number of pages to crawl.                       |
| `max_depth`           | integer   | `3`         | Maximum link depth to follow from the starting URL.     |
| `max_discovery_depth` | integer   | `3`         | Maximum depth used while discovering links to crawl.    |
| `include_paths`       | string\[] | none        | URL path patterns to include in the crawl.              |
| `exclude_paths`       | string\[] | none        | URL path patterns to exclude from the crawl.            |
| `add_to_datasets`     | string\[] | none        | Dataset IDs to auto-attach the crawled page sources to. |
| `crawler_provider`    | enum      | `firecrawl` | Crawler to use: `firecrawl` or `crawl4ai`.              |

<Tip>
  Set `add_to_datasets` at crawl time so every page discovered by the crawl is attached to your dataset automatically. Otherwise you would have to attach each page source by hand afterward. See [Managing content](/knowledge-bases/managing-content).
</Tip>

## Multi-page crawls

A crawl rarely produces a single page. As the provider reports crawled pages, the platform creates **one child source per discovered page**, named by that page's URL, and stores its extracted Markdown. When the starting URL itself is a crawled page, the parent source holds that page's content instead of a duplicate.

Every child source is automatically attached to the datasets you listed in `add_to_datasets`, so the whole site becomes retrievable without further steps.

## Providers

Anyreach crawls with one of two providers. Both request **Markdown of the main content only** and strip out navigation, scripts, styles, images, and other non-content tags so the ingested text is clean.

| Provider  | Value       | Notes                          |
| --------- | ----------- | ------------------------------ |
| Firecrawl | `firecrawl` | Default. Hosted crawl service. |
| Crawl4AI  | `crawl4ai`  | Self-hosted crawler.           |

The default provider is **Firecrawl**. If you do not set `crawler_provider`, the crawl is submitted to Firecrawl.

When the fallback is enabled, a Firecrawl submission that fails falls back **silently** to the self-hosted Crawl4AI crawler — the source is still crawled, just by the other provider. Explicitly setting `crawler_provider` to `crawl4ai` sends the crawl straight to Crawl4AI with no Firecrawl attempt.

<Note>
  Both providers exclude tags such as `nav`, `header`, `footer`, `script`, `style`, `iframe`, `form`, and media tags (`img`, `video`, `audio`, `svg`) so only readable page content is ingested.
</Note>

## Crawling vs. the website demo overlay

Crawling and the website demo overlay both involve a website, but they are different features.

|           | Crawling URLs                             | [Website demo overlay](/agents/website-demo-overlay) |
| --------- | ----------------------------------------- | ---------------------------------------------------- |
| Purpose   | Ingest site content into a knowledge base | Preview an agent on top of a live site               |
| Result    | Page content stored and made retrievable  | Visual overlay only                                  |
| Ingestion | Yes — pages become searchable sources     | No content is ingested                               |

Use crawling when you want an agent to **answer from** a website's content. Use the demo overlay only to **preview** an agent experience on a page.

## Related

<CardGroup cols={2}>
  <Card title="Uploading documents" icon="file-arrow-up" href="/knowledge-bases/uploading-documents">
    Add files directly instead of crawling a URL.
  </Card>

  <Card title="Managing content" icon="folder-tree" href="/knowledge-bases/managing-content">
    Organize sources and attach them to datasets.
  </Card>
</CardGroup>
