AI Crawlers — Complete Reference
Last updated: June 1, 2026
What this guide is and is not. This is the per-crawler reference for the AI crawler ecosystem as documented by each provider in 2026. For each crawler we separate documented facts (cited to the provider's own page) from recommendations (our practice). If you want a shorter primer on why AI crawler accessibility matters, see AI Crawler Optimization or AI Crawler Accessibility. If you want the robots.txt protocol in depth, see the robots.txt — Complete Guide.
1. Why a crawler reference exists
The AI crawler ecosystem changed faster than its documentation between 2023 and 2026. New user-agents arrived. Existing ones were renamed or split. Opt-out mechanisms shifted from undocumented to documented to standardized. Site operators trying to make informed access decisions found themselves stitching together blog posts, support articles, and community-circulated lists.
This guide is a current-state per-crawler reference. For each documented crawler, we list:
- The user-agent string the provider publishes.
- The documented purpose of the crawler.
- The opt-out mechanism the provider documents.
- The provider's canonical source for the above (linked in the Sources section).
- Where relevant, what is not documented and should not be assumed.
The format every per-crawler subsection follows is consistent: Documented facts cited to the provider, Recommended practice from us, and where applicable Note on what is not documented. This separation is deliberate — community-circulated AI crawler lists conflate the three and that is where most mistakes start.
At-a-glance crawler matrix
This table is a quick-reference snapshot. Per-crawler details follow in §2–§10. Behavior in this matrix is from each provider's own documentation; treat anything not in the provider's docs as inference, not fact.
| User-agent | Provider | Documented purpose | Opt-out |
|---|---|---|---|
GPTBot | OpenAI | Crawl public web for OpenAI model training | robots.txt block |
OAI-SearchBot | OpenAI | Surface content in ChatGPT Search results | robots.txt block (distinct from GPTBot) |
ChatGPT-User | OpenAI | User-initiated retrieval within ChatGPT | robots.txt block (with caveats — see §4) |
ClaudeBot | Anthropic | Content fetching for Claude products | robots.txt block |
anthropic-ai | Anthropic | Historically training-data crawling | robots.txt block |
PerplexityBot | Perplexity | Fetch content cited in Perplexity answers | robots.txt block |
Google-Extended | Control Google AI generative training use | robots.txt token (does not affect Search) | |
meta-externalagent | Meta | Crawl for Meta AI products | robots.txt block |
FacebookBot | Meta | Crawl content for various Meta surfaces | robots.txt block |
Applebot | Apple | Search and Siri suggestions | robots.txt block |
Applebot-Extended | Apple | Control Apple generative AI training use | robots.txt token (does not affect Search/Siri) |
Bytespider | ByteDance | Content crawling — provider documentation is minimal | robots.txt block (de facto) |
CCBot | Common Crawl | Open web archive used by many downstream consumers | robots.txt block |
2. GPTBot (OpenAI)
- DOCUMENTED
- OpenAI publishes
GPTBotas the user-agent used to crawl public web content for OpenAI model training. OpenAI documents the user-agent string and the robots.txt opt-out mechanism in its bots documentation. OpenAI also publishes a JSON file listing GPTBot's IP ranges, allowing IP-level verification. - RECOMMENDED
- Decide opt-in or opt-out deliberately. If your content is meant to be widely used (open-source documentation, public reference), allow GPTBot. If your content is your competitive product (paywalled work, proprietary research), consider blocking it. Whichever you pick, make the choice explicit in your robots.txt — do not rely on default behavior.
- NOTE
- GPTBot is for training. Blocking GPTBot does not prevent ChatGPT-User from fetching your URL when a user pastes it into a prompt. See §4.
User-agent: GPTBot
Disallow: /
3. OAI-SearchBot (OpenAI)
- DOCUMENTED
- OpenAI documents
OAI-SearchBotas the user-agent used to surface content in ChatGPT Search. OpenAI documents it as distinct from GPTBot in its bots overview — they can be allowed or blocked independently. - RECOMMENDED
- If you want your content to appear in ChatGPT Search results, allow OAI-SearchBot explicitly. Blocking GPTBot but allowing OAI-SearchBot is a coherent stance: opt out of training while remaining visible to ChatGPT Search.
- NOTE
- The two crawlers have separate purposes; treating them as one is a frequent mistake.
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
4. ChatGPT-User (OpenAI)
- DOCUMENTED
- OpenAI documents
ChatGPT-Useras the user-agent that fires when a ChatGPT user takes an action requiring a web fetch — for example, asking ChatGPT to summarize a URL the user pasted. OpenAI lists it in the bots overview alongside GPTBot and OAI-SearchBot. - RECOMMENDED
- Understand that ChatGPT-User is user-driven, not crawler-driven. Blocking it primarily prevents users from successfully requesting your URL through ChatGPT — which may or may not be your goal. Most sites allow it.
- NOTE
- Because ChatGPT-User responds to user actions, blocking it does not stop a user from copy-pasting your content into a prompt directly.
5. ClaudeBot & anthropic-ai (Anthropic)
- DOCUMENTED
- Anthropic documents two user-agents.
ClaudeBotis associated with content fetching for Claude products, including the web-search tool.anthropic-aihas historically been associated with training-data crawling. Both can be blocked separately in robots.txt. Anthropic's canonical reference is the support article linked in the Sources. - RECOMMENDED
- If you want to allow live retrieval but opt out of training, allow
ClaudeBotand blockanthropic-ai. If you want to opt out of both, block both. Verify your robots.txt response after deploying the change. - NOTE
- Anthropic does not operate a public search engine, so "Claude visibility" is qualitatively different from search visibility — see the Claude Visibility — Complete Guide for the full discussion.
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Disallow: /
6. PerplexityBot (Perplexity)
- DOCUMENTED
- Perplexity documents
PerplexityBotas its content-fetching crawler in its bots guide. The crawler retrieves content that may be cited in Perplexity answers. - RECOMMENDED
- Perplexity's stated value proposition is citing the sources it uses. Allowing PerplexityBot enables citation. Blocking it removes the citation channel.
- NOTE
- Perplexity's documentation is the source of truth for the current user-agent string and any additional user-agents the company may add.
User-agent: PerplexityBot
Allow: /
7. Google-Extended (Google)
- DOCUMENTED
- Google introduced
Google-Extendedas a stand-alone robots.txt token in 2023 to give web publishers explicit control over whether their content is used to improve Google's generative AI products (Bard / Gemini and downstream). It is documented in Google's blog post on web publisher controls and in the Search Central crawlers overview. Disallowing Google-Extended does not affect Google Search ranking. - RECOMMENDED
- Site operators who want their content in Google Search but not in Google's AI training set should add
Google-Extendedas a separate User-agent block withDisallow: /. Operators who want both should allow both. - NOTE
- Google-Extended is a token in robots.txt, not a crawler. It does not have its own fetch traffic; it is a signal Google reads when deciding training inclusion.
User-agent: Google-Extended
Disallow: /
8. Meta — meta-externalagent and FacebookBot
- DOCUMENTED
- Meta documents
meta-externalagentfor its AI crawler activity andFacebookBotfor content used across various Meta surfaces (including link previews and feed). Both are documented in Meta's developer documentation for sharing. - RECOMMENDED
- If you want link previews on Facebook to continue working, allow
FacebookBot. If you want to opt out of Meta's AI training use, blockmeta-externalagent. These are independent decisions. - NOTE
- Meta's user-agent landscape has been less stable than OpenAI's or Anthropic's. Re-check the developer documentation when reviewing your robots.txt — the canonical name may shift.
9. Applebot & Applebot-Extended (Apple)
- DOCUMENTED
- Apple documents two user-agents in its support center.
Applebotpowers Apple's classic search and Siri suggestions.Applebot-Extended, introduced more recently, controls whether your content is used to improve Apple's generative AI products. Apple documents both in its Applebot support article. - RECOMMENDED
- If you want your content surfaced in Siri suggestions or Apple's search products, allow
Applebot. If you want to opt out of generative AI training, blockApplebot-Extendedseparately. The split mirrors Google's Googlebot / Google-Extended pattern. - NOTE
- Blocking Applebot-Extended does not affect Applebot's classic search and Siri behavior. Treat them independently.
User-agent: Applebot
Allow: /
User-agent: Applebot-Extended
Disallow: /
10. Other documented AI crawlers — Bytespider and CCBot
Bytespider (ByteDance)
- DOCUMENTED
- ByteDance operates
Bytespideras a content crawler. Provider documentation has historically been minimal compared to OpenAI or Anthropic; the canonical reference is ByteDance'sbytespider.comdomain. - RECOMMENDED
- If you do not have a business reason to be visible to ByteDance products, blocking Bytespider is a reasonable default. The provider's documentation has not historically committed to ranking or visibility benefit for allowing it.
- NOTE
- Because the documentation surface is thin, behavior should be treated as inferred until ByteDance publishes more detail.
CCBot (Common Crawl)
- DOCUMENTED
- Common Crawl operates
CCBotto maintain an open web archive that is widely used by downstream AI training pipelines, academic research, and search tooling. Common Crawl publishes documentation for CCBot including the robots.txt opt-out. - RECOMMENDED
- Blocking CCBot is the broadest single lever you have against AI training inclusion, because many downstream consumers train on Common Crawl. Conversely, allowing CCBot keeps your content available to a wide range of legitimate research and open-source uses.
- NOTE
- CCBot's reach is downstream: blocking it does not retroactively remove content already archived.
11. Identification techniques
Any User-Agent string can be spoofed. A robust identification process layers multiple signals.
User-Agent string match
The first line of identification. Each provider publishes the canonical UA string in its bots documentation. Match against the documented value (substring match works for most cases). This is the cheapest signal and the easiest to spoof; never use it alone.
IP-range cross-check
Where a provider publishes IP ranges (OpenAI publishes a JSON file for GPTBot, Anthropic and others vary), confirm the source IP falls inside the published range. Provider IP ranges change; refresh your local copy on a documented cadence.
Reverse DNS
For providers whose crawlers come from a domain they control, perform a reverse DNS lookup on the source IP and confirm the resolved hostname belongs to the expected provider domain. Then forward-confirm: resolve the hostname back to an IP and confirm it matches the original. This pattern is well-established in classic search crawler verification (Google publishes the technique in its Search Central docs); not all AI providers support it yet.
Defense in depth
Combine the three signals when the cost of misidentification is meaningful. For most marketing sites, User-Agent match plus simple rate-limiting is sufficient; for an API or paywalled content surface, add IP-range + DNS verification.
12. Access control patterns
The three common patterns are allow-all, selective-block, and AI-only-block. Each is a legitimate stance.
Pattern A — allow all (the helperg.com default)
# Allow all crawlers, AI included
User-agent: *
Allow: /
Used when discovery and citation are the goals. The current helperg.com /robots.txt implements this pattern, with named entries for AI crawlers to make the intent explicit.
Pattern B — selective-block (training out, retrieval in)
User-agent: GPTBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: *
Allow: /
Used when you want to remain visible to live retrieval (OAI-SearchBot, ChatGPT-User, ClaudeBot for tool fetches, PerplexityBot) but opt out of training inclusion.
Pattern C — AI-only-block (broad opt-out)
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: *
Allow: /
Used when AI inclusion is incompatible with your business model. Note that this does not prevent humans from copying your content into prompts.
For full robots.txt syntax, precedence rules, and edge cases (wildcards, end-anchors, comments), see the robots.txt — Complete Guide.
13. Monitoring
Without monitoring you do not know whether your access control choices are reaching production or being honored. The minimum useful monitoring for a small operator:
- Server log review. Group requests by User-Agent. Look for the documented strings above. Confirm the counts make intuitive sense — a sudden spike or zero where you expect traffic is the signal.
- Robots.txt fetch confirmation. Verify documented crawlers are fetching
/robots.txtbefore fetching content, and that the response code is 200. A misconfigured server returning 5xx for robots.txt can defeat your access control entirely. - Disallow path spot-check. If you have specifically disallowed a path, confirm in your logs that the documented crawlers are not fetching it.
- Provider IP range refresh. Where IP ranges are published, refresh them quarterly or after a provider announces changes.
For most operators, log review monthly is sufficient. For high-value content or sites where a single AI-source-of-traffic shift would matter, consider a lightweight automated check.
14. Common mistakes
- Treating User-Agent string as identity. Any client can send any UA string. Confidence requires multiple signals; for high-stakes decisions, layer IP-range and reverse-DNS verification.
- Assuming a robots.txt block stops humans. A robots.txt block does not prevent a user from pasting your content into a Claude or ChatGPT prompt. It is a crawler-access control, not a content-protection mechanism.
- Blocking GPTBot but expecting OAI-SearchBot visibility. They are distinct user-agents with distinct opt-outs. Decide each separately.
- Blocking Google-Extended thinking it affects Search ranking. It does not. Google-Extended controls AI-training inclusion only; classic Google Search is governed by Googlebot.
- Blanket "block all AI" without business reasoning. If your content is a marketing site whose goal is reach, blocking all AI crawlers removes a citation channel without benefit. The deliberate choice may be allow-all.
- Ignoring downstream consumers. Common Crawl's CCBot powers many downstream training pipelines you may not have considered. Block CCBot if AI-training inclusion is what you want to avoid.
- Skipping verification. An opt-out rule that is never confirmed in logs is a statement of intent, not a documented outcome.
- Trusting community-circulated AI crawler lists. Provider-published documentation is the source of truth. Lists that are not cited to provider pages should be treated as inference.
15. Checklist
The implementation checklist for the crawler-access decisions on this page. Items below are baseline; scale up to defense-in-depth for higher-value content.
- Robots.txt exists at the site root and returns HTTP 200.
- Robots.txt names
GPTBotwith a deliberate Allow or Disallow rule. - Robots.txt names
OAI-SearchBotseparately from GPTBot. - Robots.txt names
ChatGPT-Userseparately, with awareness that this is user-initiated. - Robots.txt names
ClaudeBotwith a deliberate rule. - Robots.txt names
anthropic-aiseparately from ClaudeBot if you want to split training vs. retrieval. - Robots.txt names
PerplexityBotwith a deliberate rule. - Robots.txt names
Google-Extendedas the training-control token. - Robots.txt names
ApplebotandApplebot-Extendedseparately. - Robots.txt names
meta-externalagentif you have opinions on Meta AI training. - Robots.txt names
FacebookBotif you want Facebook link previews. - Robots.txt names
Bytespiderwith a deliberate rule. - Robots.txt names
CCBotwith awareness of its downstream reach. - Server logs capture User-Agent for each request and are reviewable.
- Robots.txt requests from documented crawlers are returning 200, not 4xx or 5xx.
- A monthly log review confirms the documented crawlers behave as expected.
- Documented IP ranges (where published) are stored locally and refreshed quarterly.
- Reverse DNS verification is implemented for any high-value path where misidentification matters.
- The Sitemap is referenced in robots.txt via the
Sitemap:directive. - Sitemap entries are canonical URLs with accurate
<lastmod>. - Each public page declares a single canonical URL.
- JSON-LD on each page is valid and does not include fabricated review or rating schema.
- Access-control decisions are documented in a short internal note so changes can be audited.
- The robots.txt change history is tracked in version control.
- An owner is identified for re-reviewing the access-control posture annually.
- The provider documentation for each crawler is bookmarked for re-verification.
- A dated capture is kept of each provider's documented user-agent strings.
16. FAQ
Does blocking GPTBot keep my content out of ChatGPT?
Not entirely. GPTBot is documented as the crawler used to gather public web data for training future OpenAI models. ChatGPT-User is a different user-agent that fires when a ChatGPT user takes an action that retrieves a URL (for example, pastes a link into a prompt). Blocking GPTBot in robots.txt does not affect ChatGPT-User fetches. OpenAI documents the distinction in its bots reference.
Is robots.txt legally binding for AI crawlers?
Robots.txt is an advisory standard defined in RFC 9309. It is widely respected by well-behaved crawlers from major providers, but it is not a legal contract and does not bind crawlers that ignore the protocol. Treat robots.txt as a strong convention, not a security boundary.
What is the difference between Googlebot and Google-Extended?
Googlebot is Google's classic search crawler that powers Google Search. Google-Extended is a separate signal that controls whether your content is used to improve Google's generative AI products such as Bard/Gemini. Disallowing Google-Extended does not affect Google Search ranking; disallowing Googlebot does. Google publishes both in its crawlers overview.
Do AI crawlers publish their IP ranges?
Some do. OpenAI publishes a JSON file of GPTBot IP ranges. Other providers vary. The most reliable verification method is to combine User-Agent string matching with a reverse DNS lookup on the provider's documented domain, where the provider documents one.
How can I confirm an AI crawler is actually who it claims to be?
Three checks: (1) match the User-Agent string against the provider's documented value, (2) cross-check the source IP against the provider's published ranges if available, and (3) perform a reverse DNS lookup to a domain controlled by the provider, then forward-confirm. Any one check can be spoofed; combining checks raises confidence.
Should I block all AI crawlers by default?
Depends on your goals. Blocking all AI crawlers reduces the chance your content is used for model training, but it also reduces the chance your content is cited or referenced by AI products that may send traffic. There is no universally correct answer — make the choice deliberately, based on what your content is for.
Does Crawl-delay work for AI crawlers?
Crawl-delay is a de-facto extension to robots.txt and is not part of RFC 9309. Some crawlers honor it; many do not. If you observe excessive load from a documented crawler, the more reliable path is to contact the provider through the support channel they document, or use server-level rate limiting.
Where is the canonical source for any specific crawler's behavior?
Always the provider's own documentation. This guide cites each provider's canonical doc in the Sources section. If a behavior is not documented by the provider, treat it as inference rather than fact.
17. Sources
- OpenAI — Bots overview (GPTBot, OAI-SearchBot, ChatGPT-User) — captured 2026-06
- OpenAI — GPTBot reference — captured 2026-06
- Anthropic — Does Anthropic crawl data from the web? — captured 2026-06
- Perplexity — Bots and crawlers — captured 2026-06
- Google — Web publisher controls (Google-Extended announcement) — captured 2026-06
- Google Search Central — Overview of Google crawlers — captured 2026-06
- Meta — Facebook crawler (FacebookBot) — captured 2026-06
- Meta — Sharing bot reference — captured 2026-06
- Apple — About Applebot — captured 2026-06
- Common Crawl — CCBot — captured 2026-06
- ByteDance — Bytespider — captured 2026-06
- RFC 9309 — Robots Exclusion Protocol — captured 2026-06
Documentation drift note: AI crawler documentation has changed several times per year. If you rely on a specific URL from this page, capture it with the date you reviewed it, and re-check periodically.