CCBot
AI / LLMby Common Crawl

CCBot

Common Crawl's crawler that builds open datasets used to train many major AI models including GPT-2, LLaMA, Mistral, and others. Its corpus is one of the most widely used training sources in the AI industry.

Respects robots.txt
Yes
Can be blocked
Yes
Crawl-Delay support
No
Type
AI / LLM

Purpose

Open web corpus used to train many AI models (GPT-2, LLaMA, Mistral)

SEO Impact

Blocking CCBot removes your content from Common Crawl datasets, which reduces the chance of it being used in open-source AI model training. Does not affect proprietary models like GPT-4.

User-Agent String

CCBot/2.0

robots.txt Control

Add "User-agent: CCBot" with "Disallow: /" in robots.txt.

Block
User-agent: CCBot
Disallow: /
Allow (default)
User-agent: CCBot
Allow: /

Official Documentation

Verify CCBot
Test your robots.txt against CCBot
Check which paths are blocked or allowed for each user-agent
Robots.txt Tester →

Other AI / LLM Bots

GPTBotChatGPT-UserClaudeBotPerplexityBot
← All Bots & Crawlers