CCBot

Common Crawl's crawler that builds open datasets used to train many major AI models including GPT-2, LLaMA, Mistral, and others. Its corpus is one of the most widely used training sources in the AI industry.

Respects robots.txt

Yes

Can be blocked

Yes

Crawl-Delay support

Type

AI / LLM

Purpose

Open web corpus used to train many AI models (GPT-2, LLaMA, Mistral)

SEO Impact

Blocking CCBot removes your content from Common Crawl datasets, which reduces the chance of it being used in open-source AI model training. Does not affect proprietary models like GPT-4.

User-Agent String

CCBot/2.0

robots.txt Control

Add "User-agent: CCBot" with "Disallow: /" in robots.txt.

Block

User-agent: CCBot
Disallow: /

Allow (default)

User-agent: CCBot
Allow: /

Official Documentation

Verify CCBot ↗

Test your robots.txt against CCBot

Check which paths are blocked or allowed for each user-agent

Robots.txt Tester →

Other AI / LLM Bots

GPTBot ChatGPT-User ClaudeBot PerplexityBot

← All Bots & Crawlers