robots.txt — Complete Guide

Q: Where should robots.txt live?

At the site root: https://example.com/robots.txt. Per RFC 9309 and Google's documentation, robots.txt is fetched at the root of the host. It is not picked up from subdirectories.

Q: Do wildcards and end-anchors work?

Yes. RFC 9309 documents wildcard '*' (matches any sequence of characters) and end-anchor '$' (matches end of URL). Both are commonly used to block query-string variants or specific file extensions.

Q: How do I block AI crawlers?

Add explicit User-agent blocks naming each AI crawler you want to block (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, etc.), each with Disallow: /. The full per-crawler reference is in our AI Crawlers — Complete Reference page. Make sure the documentation you cite is from the provider itself, not community-circulated lists.

Q: Should I list my sitemap in robots.txt?

Yes. The Sitemap directive in robots.txt is the canonical, no-account-needed way to advertise your sitemap. RFC 9309 documents it. Use the absolute URL: Sitemap: https://example.com/sitemap.xml.

Last updated: June 1, 2026

What this guide is and is not. This is the protocol-anchored reference for robots.txt: RFC 9309, the four directives, AI crawler control, common mistakes, validation, and helperg.com's live /robots.txt as a worked example. If you want a task-focused workflow for crawlability problems, see How to Improve Crawlability. For per-crawler details (each AI crawler's user-agent, opt-out, documentation), see the AI Crawlers — Complete Reference.

1. Robots fundamentals

The Robots Exclusion Protocol is specified in RFC 9309. The protocol is advisory: it signals a site owner's preferences to well-behaved automated clients. It does not bind crawlers that choose to ignore it, and it is not a security boundary.

A robots.txt file lives at the host root — https://example.com/robots.txt. Crawlers fetch it before crawling content. Subdirectory files are not picked up. One robots.txt per host (origin), no exceptions.

The file is plain text. Encoding is UTF-8. The line-based grammar is intentionally simple: groups of directives are addressed to one or more User-agents, followed by Allow / Disallow / Sitemap rules.

2. The four directives

RFC 9309 specifies four directives. Plus the de-facto extension Crawl-delay which is widely used but not part of the standard.

User-agent

Names the crawler the following rules apply to. A literal wildcard * matches all crawlers. Multiple User-agent lines can stack to apply the same rules to multiple crawlers.

User-agent: Googlebot
Disallow: /admin/

User-agent: *
Allow: /

Disallow

Tells the named crawler not to fetch URLs matching the path. The path is matched against the URL's path component starting from /.

Disallow: /admin/
Disallow: /draft-

Allow

Documented in RFC 9309 as a permission directive. Useful when carving exceptions out of a broader Disallow.

Disallow: /docs/
Allow: /docs/public/

Sitemap

Advertises the location of one or more sitemaps. Uses an absolute URL. Can appear anywhere in the file; not bound to a User-agent group.

Sitemap: https://example.com/sitemap.xml

Crawl-delay (de-facto extension, NOT in RFC 9309)

A numeric delay in seconds between fetches. Google has publicly stated it ignores Crawl-delay; many other crawlers honor it. Use server-side rate limiting if Crawl-delay does not have the effect you need.

3. Precedence and matching rules

When a URL matches more than one rule, RFC 9309 specifies a deterministic precedence:

Most specific rule wins. The rule with the longer matching path prefix takes effect.
If two rules tie on length, Allow wins over Disallow. The less restrictive rule applies when length is identical.
If no rule matches, the URL is allowed. Default behavior is permissive.

User-agent matching uses a longest-substring rule: a crawler picks the most specific User-agent group that matches its product token, then applies only that group's rules. If no group matches, the User-agent: * group applies if one exists.

4. Wildcards, end-anchors, and parameter URLs

RFC 9309 documents two pattern characters that are widely supported:

Asterisk wildcard `*`

Matches any sequence of characters (including the empty string). Useful for matching query-string variants or specific file extensions across paths.

# Block all PDFs from being crawled
User-agent: *
Disallow: /*.pdf$

# Block parameter URLs containing 'ref='
Disallow: /*?ref=

End-anchor `$`

Matches the end of the URL path. Combined with *, lets you match exact file extensions.

Parameter URLs

Common practice for sites with many parameter variants: Disallow the parameter pattern, Allow the clean canonical path. For full canonical-URL guidance see our Canonical URLs — Complete Guide.

5. AI crawler control

Robots.txt is the canonical control plane for AI crawler access. Each major AI provider publishes a documented User-agent that you can name in a User-agent block and opt out of via Disallow: /. For the per-crawler reference (user-agent strings, documented purpose, provider documentation), see the AI Crawlers — Complete Reference.

The three common stances:

Allow all

User-agent: *
Allow: /

Selective block — training out, retrieval in

User-agent: GPTBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Allow: /

AI-only block

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Allow: /

Choose deliberately. Blocking AI crawlers reduces the chance your content is used for training; it may also reduce the chance your content is cited or referenced by AI products that may send traffic.

6. Worked example — helperg.com's robots.txt

The current helperg.com /robots.txt demonstrates the allow-all-with-explicit-AI-naming pattern. It welcomes every named AI crawler explicitly rather than relying on the default. Excerpt:

# HELPERG LLC — robots.txt
# https://helperg.com

# All crawlers (default)
User-agent: *
Allow: /
Disallow: /admin/

# AI crawlers — explicitly welcome
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

# Sitemap
Sitemap: https://helperg.com/sitemap.xml

The deliberate choice to name each AI crawler with an explicit Allow: communicates intent to readers and to crawler operators reviewing logs. The default would already allow them; naming makes the policy auditable.

7. Testing and validation

Local syntax check

Open the file in a browser at your host's root. Confirm it returns HTTP 200 with Content-Type: text/plain. A robots.txt that returns 404 or 5xx is treated by Google as fully allowing all crawls; a 5xx persisting for over 30 days may eventually be treated as fully blocking.

Path-match testing

Test specific URLs against your rules using helperg.com's robots.txt validator or Google's Search Console robots.txt tester for URLs that should resolve as Allow vs. Disallow.

Production confirmation

After deploying a change, review server logs to confirm documented crawlers are fetching /robots.txt and receiving 200, and that their behavior on disallowed paths matches expectations.

8. Common mistakes

Treating Disallow as a noindex. Disallow prevents crawling; it does not prevent indexing of URLs discovered through links. To exclude from the index, use a noindex meta tag and allow crawling so the tag can be read.
Blocking CSS or JS that the page needs to render. If Google cannot render the page, its layout and content assessment suffers. Allow CSS and JS unless there is a specific reason to block.
Disallowing a security-sensitive path. Listing /admin/ in robots.txt advertises its existence. Use authentication or server-side IP allowlisting for true access control; do not put security-sensitive paths in a publicly-readable robots.txt.
Conflicting rules in the same group. An Allow and Disallow that overlap should be resolved by RFC 9309's precedence; in practice, test the result on the validator.
Returning HTML for robots.txt. Some misconfigured servers return the homepage HTML for any missing path including /robots.txt. Confirm Content-Type is text/plain.
Forgetting the Sitemap directive. Listing the sitemap in robots.txt is the canonical no-account way to advertise it. Most major crawlers honor it.
Assuming robots.txt is honored by all crawlers. It is an advisory standard. Malicious scrapers commonly ignore it. Use rate limiting or WAF rules for protection.
Forgetting to version the file. Robots.txt changes are policy changes; track them in source control alongside the rest of the site.

9. What robots.txt is NOT

Robots.txt is not a security boundary. It is an advisory standard. A scraper or a hostile crawler can ignore it. Do not place security-sensitive paths in robots.txt — that advertises them. Use authentication, authorization, and server-side controls for content that must be protected.

Not a noindex mechanism. Disallow blocks crawl; it does not remove URLs from search indexes.
Not a way to hide content from humans. Anyone can fetch /robots.txt and see the listed paths.
Not enforceable by law in most jurisdictions.
Not a rate-limit. Use server-side controls if you need to enforce request rates.
Not retroactive. Adding a Disallow today does not remove content from indexes that have already crawled it.

10. Checklist

robots.txt exists at the host root.
robots.txt returns HTTP 200 with Content-Type: text/plain.
robots.txt is UTF-8 encoded.
The file uses LF line endings (CRLF works for most crawlers but LF is safer).
Every User-agent group is correctly grouped (User-agent line(s) followed by Allow/Disallow rules).
A Sitemap directive points to your sitemap.xml using an absolute URL.
The Sitemap URL matches your canonical scheme (HTTPS).
The default User-agent: * group declares your default policy.
Each AI crawler you have a position on is named explicitly (Allow or Disallow).
CSS and JS that the site needs to render are not blocked.
Security-sensitive paths are not listed in robots.txt.
The file is no larger than 500 KiB (Google's stated limit).
Comments use the # prefix and are limited to one per line.
Wildcards (*) and end-anchors ($) are used where appropriate.
Crawl-delay, if used, is documented as a soft preference (Google ignores it).
Conflicting Allow/Disallow rules have been tested in a validator.
robots.txt is version-controlled with the rest of the site.
A change-history is reviewable from the version control log.
An owner is identified for re-reviewing the access policy at a documented cadence.
The file has been validated with at least one validator (helperg.com tool, Google Search Console, or a third-party).

11. FAQ

Is robots.txt legally binding?

No. RFC 9309 defines robots.txt as an advisory standard. It is widely respected by well-behaved crawlers from major providers, but it does not bind crawlers that ignore the protocol and it is not a legal contract. Treat robots.txt as a strong convention, not a security boundary.

Where should robots.txt live?

At the site root: https://example.com/robots.txt. Per RFC 9309 and Google's documentation, robots.txt is fetched at the root of the host. It is not picked up from subdirectories.

Does Crawl-delay work?

Crawl-delay is a de-facto extension not specified in RFC 9309. Some crawlers honor it; many do not. Google has documented that it ignores Crawl-delay in robots.txt and that crawl rate should be managed through Search Console or server-side controls instead.

Can I use robots.txt to hide pages from search?

No, not reliably. A Disallow rule prevents crawling but does not prevent indexing — Google may still index a URL it discovers through links even if the page itself is disallowed. To exclude a page from indexing, use a noindex meta tag on the page itself and let crawlers reach the page to read the tag, or use HTTP authentication for true access control.

What is the precedence between Allow and Disallow?

RFC 9309 specifies that the most specific rule applies: when a path matches both an Allow and a Disallow, the rule with the longer matching path wins. If lengths are equal, the less restrictive rule (Allow) wins. Google publishes the same precedence in its robots intro.

Do wildcards and end-anchors work?

Yes. RFC 9309 documents wildcard * (matches any sequence of characters) and end-anchor $ (matches end of URL). Both are commonly used to block query-string variants or specific file extensions.

How do I block AI crawlers?

Add explicit User-agent blocks naming each AI crawler you want to block (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, etc.), each with Disallow: /. The full per-crawler reference is in our AI Crawlers — Complete Reference page. Make sure the documentation you cite is from the provider itself, not community-circulated lists.

Should I list my sitemap in robots.txt?

Yes. The Sitemap directive in robots.txt is the canonical, no-account-needed way to advertise your sitemap. RFC 9309 documents it. Use the absolute URL: Sitemap: https://example.com/sitemap.xml.

12. Sources

RFC 9309 — Robots Exclusion Protocol — captured 2026-06
Google Search Central — Introduction to robots.txt — captured 2026-06
Google Search Central — Create and submit a robots.txt file — captured 2026-06
Google Search Central — Sitemap overview (for the Sitemap directive) — captured 2026-06
OpenAI — Bots overview (per-crawler reference) — captured 2026-06
Anthropic — Crawler documentation — captured 2026-06
Perplexity — Bots and crawlers — captured 2026-06

robots.txt — Complete Guide

1. Robots fundamentals

2. The four directives

User-agent

Disallow

Allow

Sitemap

Crawl-delay (de-facto extension, NOT in RFC 9309)

3. Precedence and matching rules

4. Wildcards, end-anchors, and parameter URLs

Asterisk wildcard *

End-anchor $

Parameter URLs

5. AI crawler control

Allow all

Selective block — training out, retrieval in

AI-only block

6. Worked example — helperg.com's robots.txt

7. Testing and validation

Local syntax check

Path-match testing

Production confirmation

8. Common mistakes

9. What robots.txt is NOT

10. Checklist

11. FAQ

Is robots.txt legally binding?

Where should robots.txt live?

Does Crawl-delay work?

Can I use robots.txt to hide pages from search?

What is the precedence between Allow and Disallow?

Do wildcards and end-anchors work?

How do I block AI crawlers?

Should I list my sitemap in robots.txt?

12. Sources

13. Related resources

Asterisk wildcard `*`

End-anchor `$`