CCBot
AI / LLMby Common Crawl
CCBot
Common Crawl's crawler that builds open datasets used to train many major AI models including GPT-2, LLaMA, Mistral, and others. Its corpus is one of the most widely used training sources in the AI industry.
Respects robots.txt
Yes
Can be blocked
Yes
Crawl-Delay support
No
Type
AI / LLM
Purpose
Open web corpus used to train many AI models (GPT-2, LLaMA, Mistral)
SEO Impact
Blocking CCBot removes your content from Common Crawl datasets, which reduces the chance of it being used in open-source AI model training. Does not affect proprietary models like GPT-4.
User-Agent String
CCBot/2.0
robots.txt Control
Add "User-agent: CCBot" with "Disallow: /" in robots.txt.
Block
User-agent: CCBot Disallow: /
Allow (default)
User-agent: CCBot Allow: /
Official Documentation
Test your robots.txt against CCBot
Check which paths are blocked or allowed for each user-agent