AI Crawlers — Complete Reference

Last updated: June 1, 2026

What this guide is and is not. This is the per-crawler reference for the AI crawler ecosystem as documented by each provider in 2026. For each crawler we separate documented facts (cited to the provider's own page) from recommendations (our practice). If you want a shorter primer on why AI crawler accessibility matters, see AI Crawler Optimization or AI Crawler Accessibility. If you want the robots.txt protocol in depth, see the robots.txt — Complete Guide.

1. Why a crawler reference exists

The AI crawler ecosystem changed faster than its documentation between 2023 and 2026. New user-agents arrived. Existing ones were renamed or split. Opt-out mechanisms shifted from undocumented to documented to standardized. Site operators trying to make informed access decisions found themselves stitching together blog posts, support articles, and community-circulated lists.

This guide is a current-state per-crawler reference. For each documented crawler, we list:

The format every per-crawler subsection follows is consistent: Documented facts cited to the provider, Recommended practice from us, and where applicable Note on what is not documented. This separation is deliberate — community-circulated AI crawler lists conflate the three and that is where most mistakes start.

At-a-glance crawler matrix

This table is a quick-reference snapshot. Per-crawler details follow in §2–§10. Behavior in this matrix is from each provider's own documentation; treat anything not in the provider's docs as inference, not fact.

User-agentProviderDocumented purposeOpt-out
GPTBotOpenAICrawl public web for OpenAI model trainingrobots.txt block
OAI-SearchBotOpenAISurface content in ChatGPT Search resultsrobots.txt block (distinct from GPTBot)
ChatGPT-UserOpenAIUser-initiated retrieval within ChatGPTrobots.txt block (with caveats — see §4)
ClaudeBotAnthropicContent fetching for Claude productsrobots.txt block
anthropic-aiAnthropicHistorically training-data crawlingrobots.txt block
PerplexityBotPerplexityFetch content cited in Perplexity answersrobots.txt block
Google-ExtendedGoogleControl Google AI generative training userobots.txt token (does not affect Search)
meta-externalagentMetaCrawl for Meta AI productsrobots.txt block
FacebookBotMetaCrawl content for various Meta surfacesrobots.txt block
ApplebotAppleSearch and Siri suggestionsrobots.txt block
Applebot-ExtendedAppleControl Apple generative AI training userobots.txt token (does not affect Search/Siri)
BytespiderByteDanceContent crawling — provider documentation is minimalrobots.txt block (de facto)
CCBotCommon CrawlOpen web archive used by many downstream consumersrobots.txt block

2. GPTBot (OpenAI)

DOCUMENTED
OpenAI publishes GPTBot as the user-agent used to crawl public web content for OpenAI model training. OpenAI documents the user-agent string and the robots.txt opt-out mechanism in its bots documentation. OpenAI also publishes a JSON file listing GPTBot's IP ranges, allowing IP-level verification.
RECOMMENDED
Decide opt-in or opt-out deliberately. If your content is meant to be widely used (open-source documentation, public reference), allow GPTBot. If your content is your competitive product (paywalled work, proprietary research), consider blocking it. Whichever you pick, make the choice explicit in your robots.txt — do not rely on default behavior.
NOTE
GPTBot is for training. Blocking GPTBot does not prevent ChatGPT-User from fetching your URL when a user pastes it into a prompt. See §4.
User-agent: GPTBot
Disallow: /

3. OAI-SearchBot (OpenAI)

DOCUMENTED
OpenAI documents OAI-SearchBot as the user-agent used to surface content in ChatGPT Search. OpenAI documents it as distinct from GPTBot in its bots overview — they can be allowed or blocked independently.
RECOMMENDED
If you want your content to appear in ChatGPT Search results, allow OAI-SearchBot explicitly. Blocking GPTBot but allowing OAI-SearchBot is a coherent stance: opt out of training while remaining visible to ChatGPT Search.
NOTE
The two crawlers have separate purposes; treating them as one is a frequent mistake.
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

4. ChatGPT-User (OpenAI)

DOCUMENTED
OpenAI documents ChatGPT-User as the user-agent that fires when a ChatGPT user takes an action requiring a web fetch — for example, asking ChatGPT to summarize a URL the user pasted. OpenAI lists it in the bots overview alongside GPTBot and OAI-SearchBot.
RECOMMENDED
Understand that ChatGPT-User is user-driven, not crawler-driven. Blocking it primarily prevents users from successfully requesting your URL through ChatGPT — which may or may not be your goal. Most sites allow it.
NOTE
Because ChatGPT-User responds to user actions, blocking it does not stop a user from copy-pasting your content into a prompt directly.

5. ClaudeBot & anthropic-ai (Anthropic)

DOCUMENTED
Anthropic documents two user-agents. ClaudeBot is associated with content fetching for Claude products, including the web-search tool. anthropic-ai has historically been associated with training-data crawling. Both can be blocked separately in robots.txt. Anthropic's canonical reference is the support article linked in the Sources.
RECOMMENDED
If you want to allow live retrieval but opt out of training, allow ClaudeBot and block anthropic-ai. If you want to opt out of both, block both. Verify your robots.txt response after deploying the change.
NOTE
Anthropic does not operate a public search engine, so "Claude visibility" is qualitatively different from search visibility — see the Claude Visibility — Complete Guide for the full discussion.
User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Disallow: /

6. PerplexityBot (Perplexity)

DOCUMENTED
Perplexity documents PerplexityBot as its content-fetching crawler in its bots guide. The crawler retrieves content that may be cited in Perplexity answers.
RECOMMENDED
Perplexity's stated value proposition is citing the sources it uses. Allowing PerplexityBot enables citation. Blocking it removes the citation channel.
NOTE
Perplexity's documentation is the source of truth for the current user-agent string and any additional user-agents the company may add.
User-agent: PerplexityBot
Allow: /

7. Google-Extended (Google)

DOCUMENTED
Google introduced Google-Extended as a stand-alone robots.txt token in 2023 to give web publishers explicit control over whether their content is used to improve Google's generative AI products (Bard / Gemini and downstream). It is documented in Google's blog post on web publisher controls and in the Search Central crawlers overview. Disallowing Google-Extended does not affect Google Search ranking.
RECOMMENDED
Site operators who want their content in Google Search but not in Google's AI training set should add Google-Extended as a separate User-agent block with Disallow: /. Operators who want both should allow both.
NOTE
Google-Extended is a token in robots.txt, not a crawler. It does not have its own fetch traffic; it is a signal Google reads when deciding training inclusion.
User-agent: Google-Extended
Disallow: /

8. Meta — meta-externalagent and FacebookBot

DOCUMENTED
Meta documents meta-externalagent for its AI crawler activity and FacebookBot for content used across various Meta surfaces (including link previews and feed). Both are documented in Meta's developer documentation for sharing.
RECOMMENDED
If you want link previews on Facebook to continue working, allow FacebookBot. If you want to opt out of Meta's AI training use, block meta-externalagent. These are independent decisions.
NOTE
Meta's user-agent landscape has been less stable than OpenAI's or Anthropic's. Re-check the developer documentation when reviewing your robots.txt — the canonical name may shift.

9. Applebot & Applebot-Extended (Apple)

DOCUMENTED
Apple documents two user-agents in its support center. Applebot powers Apple's classic search and Siri suggestions. Applebot-Extended, introduced more recently, controls whether your content is used to improve Apple's generative AI products. Apple documents both in its Applebot support article.
RECOMMENDED
If you want your content surfaced in Siri suggestions or Apple's search products, allow Applebot. If you want to opt out of generative AI training, block Applebot-Extended separately. The split mirrors Google's Googlebot / Google-Extended pattern.
NOTE
Blocking Applebot-Extended does not affect Applebot's classic search and Siri behavior. Treat them independently.
User-agent: Applebot
Allow: /

User-agent: Applebot-Extended
Disallow: /

10. Other documented AI crawlers — Bytespider and CCBot

Bytespider (ByteDance)

DOCUMENTED
ByteDance operates Bytespider as a content crawler. Provider documentation has historically been minimal compared to OpenAI or Anthropic; the canonical reference is ByteDance's bytespider.com domain.
RECOMMENDED
If you do not have a business reason to be visible to ByteDance products, blocking Bytespider is a reasonable default. The provider's documentation has not historically committed to ranking or visibility benefit for allowing it.
NOTE
Because the documentation surface is thin, behavior should be treated as inferred until ByteDance publishes more detail.

CCBot (Common Crawl)

DOCUMENTED
Common Crawl operates CCBot to maintain an open web archive that is widely used by downstream AI training pipelines, academic research, and search tooling. Common Crawl publishes documentation for CCBot including the robots.txt opt-out.
RECOMMENDED
Blocking CCBot is the broadest single lever you have against AI training inclusion, because many downstream consumers train on Common Crawl. Conversely, allowing CCBot keeps your content available to a wide range of legitimate research and open-source uses.
NOTE
CCBot's reach is downstream: blocking it does not retroactively remove content already archived.

11. Identification techniques

Any User-Agent string can be spoofed. A robust identification process layers multiple signals.

User-Agent string match

The first line of identification. Each provider publishes the canonical UA string in its bots documentation. Match against the documented value (substring match works for most cases). This is the cheapest signal and the easiest to spoof; never use it alone.

IP-range cross-check

Where a provider publishes IP ranges (OpenAI publishes a JSON file for GPTBot, Anthropic and others vary), confirm the source IP falls inside the published range. Provider IP ranges change; refresh your local copy on a documented cadence.

Reverse DNS

For providers whose crawlers come from a domain they control, perform a reverse DNS lookup on the source IP and confirm the resolved hostname belongs to the expected provider domain. Then forward-confirm: resolve the hostname back to an IP and confirm it matches the original. This pattern is well-established in classic search crawler verification (Google publishes the technique in its Search Central docs); not all AI providers support it yet.

Defense in depth

Combine the three signals when the cost of misidentification is meaningful. For most marketing sites, User-Agent match plus simple rate-limiting is sufficient; for an API or paywalled content surface, add IP-range + DNS verification.

12. Access control patterns

The three common patterns are allow-all, selective-block, and AI-only-block. Each is a legitimate stance.

Pattern A — allow all (the helperg.com default)

# Allow all crawlers, AI included
User-agent: *
Allow: /

Used when discovery and citation are the goals. The current helperg.com /robots.txt implements this pattern, with named entries for AI crawlers to make the intent explicit.

Pattern B — selective-block (training out, retrieval in)

User-agent: GPTBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Allow: /

Used when you want to remain visible to live retrieval (OAI-SearchBot, ChatGPT-User, ClaudeBot for tool fetches, PerplexityBot) but opt out of training inclusion.

Pattern C — AI-only-block (broad opt-out)

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Allow: /

Used when AI inclusion is incompatible with your business model. Note that this does not prevent humans from copying your content into prompts.

For full robots.txt syntax, precedence rules, and edge cases (wildcards, end-anchors, comments), see the robots.txt — Complete Guide.

13. Monitoring

Without monitoring you do not know whether your access control choices are reaching production or being honored. The minimum useful monitoring for a small operator:

For most operators, log review monthly is sufficient. For high-value content or sites where a single AI-source-of-traffic shift would matter, consider a lightweight automated check.

14. Common mistakes

  1. Treating User-Agent string as identity. Any client can send any UA string. Confidence requires multiple signals; for high-stakes decisions, layer IP-range and reverse-DNS verification.
  2. Assuming a robots.txt block stops humans. A robots.txt block does not prevent a user from pasting your content into a Claude or ChatGPT prompt. It is a crawler-access control, not a content-protection mechanism.
  3. Blocking GPTBot but expecting OAI-SearchBot visibility. They are distinct user-agents with distinct opt-outs. Decide each separately.
  4. Blocking Google-Extended thinking it affects Search ranking. It does not. Google-Extended controls AI-training inclusion only; classic Google Search is governed by Googlebot.
  5. Blanket "block all AI" without business reasoning. If your content is a marketing site whose goal is reach, blocking all AI crawlers removes a citation channel without benefit. The deliberate choice may be allow-all.
  6. Ignoring downstream consumers. Common Crawl's CCBot powers many downstream training pipelines you may not have considered. Block CCBot if AI-training inclusion is what you want to avoid.
  7. Skipping verification. An opt-out rule that is never confirmed in logs is a statement of intent, not a documented outcome.
  8. Trusting community-circulated AI crawler lists. Provider-published documentation is the source of truth. Lists that are not cited to provider pages should be treated as inference.

15. Checklist

The implementation checklist for the crawler-access decisions on this page. Items below are baseline; scale up to defense-in-depth for higher-value content.

16. FAQ

Does blocking GPTBot keep my content out of ChatGPT?

Not entirely. GPTBot is documented as the crawler used to gather public web data for training future OpenAI models. ChatGPT-User is a different user-agent that fires when a ChatGPT user takes an action that retrieves a URL (for example, pastes a link into a prompt). Blocking GPTBot in robots.txt does not affect ChatGPT-User fetches. OpenAI documents the distinction in its bots reference.

Is robots.txt legally binding for AI crawlers?

Robots.txt is an advisory standard defined in RFC 9309. It is widely respected by well-behaved crawlers from major providers, but it is not a legal contract and does not bind crawlers that ignore the protocol. Treat robots.txt as a strong convention, not a security boundary.

What is the difference between Googlebot and Google-Extended?

Googlebot is Google's classic search crawler that powers Google Search. Google-Extended is a separate signal that controls whether your content is used to improve Google's generative AI products such as Bard/Gemini. Disallowing Google-Extended does not affect Google Search ranking; disallowing Googlebot does. Google publishes both in its crawlers overview.

Do AI crawlers publish their IP ranges?

Some do. OpenAI publishes a JSON file of GPTBot IP ranges. Other providers vary. The most reliable verification method is to combine User-Agent string matching with a reverse DNS lookup on the provider's documented domain, where the provider documents one.

How can I confirm an AI crawler is actually who it claims to be?

Three checks: (1) match the User-Agent string against the provider's documented value, (2) cross-check the source IP against the provider's published ranges if available, and (3) perform a reverse DNS lookup to a domain controlled by the provider, then forward-confirm. Any one check can be spoofed; combining checks raises confidence.

Should I block all AI crawlers by default?

Depends on your goals. Blocking all AI crawlers reduces the chance your content is used for model training, but it also reduces the chance your content is cited or referenced by AI products that may send traffic. There is no universally correct answer — make the choice deliberately, based on what your content is for.

Does Crawl-delay work for AI crawlers?

Crawl-delay is a de-facto extension to robots.txt and is not part of RFC 9309. Some crawlers honor it; many do not. If you observe excessive load from a documented crawler, the more reliable path is to contact the provider through the support channel they document, or use server-level rate limiting.

Where is the canonical source for any specific crawler's behavior?

Always the provider's own documentation. This guide cites each provider's canonical doc in the Sources section. If a behavior is not documented by the provider, treat it as inference rather than fact.

17. Sources

Documentation drift note: AI crawler documentation has changed several times per year. If you rely on a specific URL from this page, capture it with the date you reviewed it, and re-check periodically.