Create a URL source
Send aURL source to the sources endpoint. Each source carries an optional url_crawl_options object that controls the crawl.
name must be a valid URL when type is URL; the request fails validation otherwise. The domain is derived automatically from the URL. Creating sources requires the sources:manage scope.
The endpoint accepts a list, so you can submit several URLs in one request. Each
URL source starts its own crawl job.Crawl lifecycle
COMPLETE as the provider reports each crawled page through the webhook, and the crawl job itself is marked COMPLETE when the provider finishes. If the crawl fails outright, the source is marked FAILED.
The file_upload_status field on a source uses these values:
| Status | Meaning |
|---|---|
PENDING | Source created; content not yet ingested. |
IN_PROGRESS | Content is being processed. |
COMPLETE | Content ingested and ready for retrieval. |
FAILED | Crawl or processing failed. |
Crawl options
Set these fields underurl_crawl_options. All are optional.
| Field | Type | Default | Description |
|---|---|---|---|
limit | integer | 10 | Maximum number of pages to crawl. |
max_depth | integer | 3 | Maximum link depth to follow from the starting URL. |
max_discovery_depth | integer | 3 | Maximum depth used while discovering links to crawl. |
include_paths | string[] | none | URL path patterns to include in the crawl. |
exclude_paths | string[] | none | URL path patterns to exclude from the crawl. |
add_to_datasets | string[] | none | Dataset IDs to auto-attach the crawled page sources to. |
crawler_provider | enum | firecrawl | Crawler to use: firecrawl or crawl4ai. |
Multi-page crawls
A crawl rarely produces a single page. As the provider reports crawled pages, the platform creates one child source per discovered page, named by that page’s URL, and stores its extracted Markdown. When the starting URL itself is a crawled page, the parent source holds that page’s content instead of a duplicate. Every child source is automatically attached to the datasets you listed inadd_to_datasets, so the whole site becomes retrievable without further steps.
Providers
Anyreach crawls with one of two providers. Both request Markdown of the main content only and strip out navigation, scripts, styles, images, and other non-content tags so the ingested text is clean.| Provider | Value | Notes |
|---|---|---|
| Firecrawl | firecrawl | Default. Hosted crawl service. |
| Crawl4AI | crawl4ai | Self-hosted crawler. |
crawler_provider, the crawl is submitted to Firecrawl.
When the fallback is enabled, a Firecrawl submission that fails falls back silently to the self-hosted Crawl4AI crawler — the source is still crawled, just by the other provider. Explicitly setting crawler_provider to crawl4ai sends the crawl straight to Crawl4AI with no Firecrawl attempt.
Both providers exclude tags such as
nav, header, footer, script, style, iframe, form, and media tags (img, video, audio, svg) so only readable page content is ingested.Crawling vs. the website demo overlay
Crawling and the website demo overlay both involve a website, but they are different features.| Crawling URLs | Website demo overlay | |
|---|---|---|
| Purpose | Ingest site content into a knowledge base | Preview an agent on top of a live site |
| Result | Page content stored and made retrievable | Visual overlay only |
| Ingestion | Yes — pages become searchable sources | No content is ingested |
Related
Uploading documents
Add files directly instead of crawling a URL.
Managing content
Organize sources and attach them to datasets.

