Skip to main content
A URL source crawls a website and ingests its page content into a knowledge base for retrieval. You give Anyreach a starting URL and crawl options; the platform follows links, extracts the main text of each page as Markdown, and stores it as searchable sources. Crawling is asynchronous. Creating a URL source kicks off a background crawl job and returns immediately — the crawl completes later when the crawl provider calls back via webhook.

Create a URL source

Send a URL source to the sources endpoint. Each source carries an optional url_crawl_options object that controls the crawl.
curl -X POST https://api.anyreach.ai/knowledge-base/sources \
  -H "Authorization: Bearer <token>" \
  -H "X-Anyreach-Org: <organization_id>" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "type": "URL",
      "name": "https://example.com/docs",
      "url_crawl_options": {
        "limit": 10,
        "max_depth": 3,
        "add_to_datasets": ["<dataset_id>"]
      }
    }
  ]'
The name must be a valid URL when type is URL; the request fails validation otherwise. The domain is derived automatically from the URL. Creating sources requires the sources:manage scope.
The endpoint accepts a list, so you can submit several URLs in one request. Each URL source starts its own crawl job.

Crawl lifecycle

POST /knowledge-base/sources (type=URL)


 parent source created (file_upload_status=PENDING)


 crawl job submitted to provider ──► returns immediately

        ▼  (async)
 provider crawls pages, calls webhook per page + on completion


 one child source per discovered page (PENDING → COMPLETE)
 child sources auto-attached to add_to_datasets
Because the crawl runs in the background, a freshly created URL source has no content yet. Page sources appear and flip to COMPLETE as the provider reports each crawled page through the webhook, and the crawl job itself is marked COMPLETE when the provider finishes. If the crawl fails outright, the source is marked FAILED. The file_upload_status field on a source uses these values:
StatusMeaning
PENDINGSource created; content not yet ingested.
IN_PROGRESSContent is being processed.
COMPLETEContent ingested and ready for retrieval.
FAILEDCrawl or processing failed.

Crawl options

Set these fields under url_crawl_options. All are optional.
FieldTypeDefaultDescription
limitinteger10Maximum number of pages to crawl.
max_depthinteger3Maximum link depth to follow from the starting URL.
max_discovery_depthinteger3Maximum depth used while discovering links to crawl.
include_pathsstring[]noneURL path patterns to include in the crawl.
exclude_pathsstring[]noneURL path patterns to exclude from the crawl.
add_to_datasetsstring[]noneDataset IDs to auto-attach the crawled page sources to.
crawler_providerenumfirecrawlCrawler to use: firecrawl or crawl4ai.
Set add_to_datasets at crawl time so every page discovered by the crawl is attached to your dataset automatically. Otherwise you would have to attach each page source by hand afterward. See Managing content.

Multi-page crawls

A crawl rarely produces a single page. As the provider reports crawled pages, the platform creates one child source per discovered page, named by that page’s URL, and stores its extracted Markdown. When the starting URL itself is a crawled page, the parent source holds that page’s content instead of a duplicate. Every child source is automatically attached to the datasets you listed in add_to_datasets, so the whole site becomes retrievable without further steps.

Providers

Anyreach crawls with one of two providers. Both request Markdown of the main content only and strip out navigation, scripts, styles, images, and other non-content tags so the ingested text is clean.
ProviderValueNotes
FirecrawlfirecrawlDefault. Hosted crawl service.
Crawl4AIcrawl4aiSelf-hosted crawler.
The default provider is Firecrawl. If you do not set crawler_provider, the crawl is submitted to Firecrawl. When the fallback is enabled, a Firecrawl submission that fails falls back silently to the self-hosted Crawl4AI crawler — the source is still crawled, just by the other provider. Explicitly setting crawler_provider to crawl4ai sends the crawl straight to Crawl4AI with no Firecrawl attempt.
Both providers exclude tags such as nav, header, footer, script, style, iframe, form, and media tags (img, video, audio, svg) so only readable page content is ingested.

Crawling vs. the website demo overlay

Crawling and the website demo overlay both involve a website, but they are different features.
Crawling URLsWebsite demo overlay
PurposeIngest site content into a knowledge basePreview an agent on top of a live site
ResultPage content stored and made retrievableVisual overlay only
IngestionYes — pages become searchable sourcesNo content is ingested
Use crawling when you want an agent to answer from a website’s content. Use the demo overlay only to preview an agent experience on a page.

Uploading documents

Add files directly instead of crawling a URL.

Managing content

Organize sources and attach them to datasets.