Crawling URLs - Anyreach

A URL source crawls a website and ingests its page content into a knowledge base for retrieval. You give Anyreach a starting URL and crawl options; the platform follows links, extracts the main text of each page as Markdown, and stores it as searchable sources. Crawling is asynchronous. Creating a URL source kicks off a background crawl job and returns immediately — the crawl completes later when the crawl provider calls back via webhook.

Create a URL source

Send a URL source to the sources endpoint. Each source carries an optional url_crawl_options object that controls the crawl.

curl -X POST https://api.anyreach.ai/knowledge-base/sources \
  -H "Authorization: Bearer <token>" \
  -H "X-Anyreach-Org: <organization_id>" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "type": "URL",
      "name": "https://example.com/docs",
      "url_crawl_options": {
        "limit": 10,
        "max_depth": 3,
        "add_to_datasets": ["<dataset_id>"]
      }
    }
  ]'

The name must be a valid URL when type is URL; the request fails validation otherwise. The domain is derived automatically from the URL. Creating sources requires the sources:manage scope.

The endpoint accepts a list, so you can submit several URLs in one request. Each URL source starts its own crawl job.

Crawl lifecycle

POST /knowledge-base/sources (type=URL)
        │
        ▼
 parent source created (file_upload_status=PENDING)
        │
        ▼
 crawl job submitted to provider ──► returns immediately
        │
        ▼  (async)
 provider crawls pages, calls webhook per page + on completion
        │
        ▼
 one child source per discovered page (PENDING → COMPLETE)
 child sources auto-attached to add_to_datasets

Because the crawl runs in the background, a freshly created URL source has no content yet. Page sources appear and flip to COMPLETE as the provider reports each crawled page through the webhook, and the crawl job itself is marked COMPLETE when the provider finishes. If the crawl fails outright, the source is marked FAILED. The file_upload_status field on a source uses these values:

Status	Meaning
`PENDING`	Source created; content not yet ingested.
`IN_PROGRESS`	Content is being processed.
`COMPLETE`	Content ingested and ready for retrieval.
`FAILED`	Crawl or processing failed.

Crawl options

Set these fields under url_crawl_options. All are optional.

Field	Type	Default	Description
`limit`	integer	`10`	Maximum number of pages to crawl.
`max_depth`	integer	`3`	Maximum link depth to follow from the starting URL.
`max_discovery_depth`	integer	`3`	Maximum depth used while discovering links to crawl.
`include_paths`	string[]	none	URL path patterns to include in the crawl.
`exclude_paths`	string[]	none	URL path patterns to exclude from the crawl.
`add_to_datasets`	string[]	none	Dataset IDs to auto-attach the crawled page sources to.
`crawler_provider`	enum	`firecrawl`	Crawler to use: `firecrawl` or `crawl4ai`.

Set add_to_datasets at crawl time so every page discovered by the crawl is attached to your dataset automatically. Otherwise you would have to attach each page source by hand afterward. See Managing content.

Multi-page crawls

A crawl rarely produces a single page. As the provider reports crawled pages, the platform creates one child source per discovered page, named by that page’s URL, and stores its extracted Markdown. When the starting URL itself is a crawled page, the parent source holds that page’s content instead of a duplicate. Every child source is automatically attached to the datasets you listed in add_to_datasets, so the whole site becomes retrievable without further steps.

Providers

Anyreach crawls with one of two providers. Both request Markdown of the main content only and strip out navigation, scripts, styles, images, and other non-content tags so the ingested text is clean.

Provider	Value	Notes
Firecrawl	`firecrawl`	Default. Hosted crawl service.
Crawl4AI	`crawl4ai`	Self-hosted crawler.

The default provider is Firecrawl. If you do not set crawler_provider, the crawl is submitted to Firecrawl. When the fallback is enabled, a Firecrawl submission that fails falls back silently to the self-hosted Crawl4AI crawler — the source is still crawled, just by the other provider. Explicitly setting crawler_provider to crawl4ai sends the crawl straight to Crawl4AI with no Firecrawl attempt.

Both providers exclude tags such as nav, header, footer, script, style, iframe, form, and media tags (img, video, audio, svg) so only readable page content is ingested.

Crawling vs. the website demo overlay

Crawling and the website demo overlay both involve a website, but they are different features.

	Crawling URLs	Website demo overlay
Purpose	Ingest site content into a knowledge base	Preview an agent on top of a live site
Result	Page content stored and made retrievable	Visual overlay only
Ingestion	Yes — pages become searchable sources	No content is ingested

Use crawling when you want an agent to answer from a website’s content. Use the demo overlay only to preview an agent experience on a page.

Uploading documents

Add files directly instead of crawling a URL.

Managing content

Organize sources and attach them to datasets.

​Create a URL source

​Crawl lifecycle

​Crawl options

​Multi-page crawls

​Providers

​Crawling vs. the website demo overlay

​Related

Uploading documents

Managing content

Create a URL source

Crawl lifecycle

Crawl options

Multi-page crawls

Providers

Crawling vs. the website demo overlay

Related