sitemap.xml — Complete Guide
Last updated: June 1, 2026
What this guide is and is not. Protocol-anchored reference for sitemap.xml: the sitemaps.org protocol, required and optional fields, lastmod accuracy, sitemap index files, image/video extensions, common errors, and validation. Worked example uses helperg.com's live /sitemap.xml. For a task-focused workflow, see How to Optimize sitemap.xml.
1. Sitemap basics
The Sitemap protocol is specified at sitemaps.org. Google, Bing, and other major search engines consume the format. A sitemap is a machine-readable inventory of URLs you want crawlers to discover.
Per the protocol, a single sitemap file can contain up to 50,000 URLs or be up to 50 MB uncompressed. Larger sites use a sitemap index file that lists multiple individual sitemaps; the per-sitemap limits still apply to each child file.
The file lives at the site root by convention (/sitemap.xml) but the actual location is advertised in robots.txt via the Sitemap: directive. Crawlers also check the default location.
2. Fields — required and optional
Per the sitemaps.org protocol, each <url> entry supports four fields:
| Field | Required? | Purpose |
|---|---|---|
<loc> | Required | The URL. Must be absolute, must use the canonical scheme (HTTPS where applicable), must be URL-encoded if it contains special characters. |
<lastmod> | Optional | Last modification date. Use accurate values; Google publishes that it uses lastmod when accurate and discounts it for the entire site when it detects systematic falsification. |
<changefreq> | Optional | Suggested change frequency. Google has documented that it ignores this field. Other crawlers vary. |
<priority> | Optional | Relative priority (0.0 to 1.0). Google has documented that it ignores this field. Most major crawlers also ignore it. |
Practical recommendation: include <loc> and an accurate <lastmod>. Skip <priority> and <changefreq> or include them but do not rely on them.
3. lastmod accuracy
Of the optional fields, <lastmod> is the one Google has explicitly stated it pays attention to. The corollary, also documented by Google: if Google detects systematic falsification of lastmod values (every URL bumped on every deployment regardless of content change), Google may discount lastmod for the entire site.
The honest pattern is to bump lastmod when the content actually changes meaningfully — new sections, substantive edits, updated facts. Do not bump it for whitespace changes, deployment cycles, or unchanged content. The signal's value depends on its accuracy.
Format: ISO 8601 date or date-time. 2026-06-01 is valid. 2026-06-01T12:00:00Z is also valid for time-of-day precision.
4. Image and video extensions
The sitemaps.org protocol allows extensions. Google publishes documented extensions for images and videos.
Image sitemap extension
Adds <image:image> elements under a <url> entry. Useful when image content is a primary surface (gallery sites, e-commerce product images, news photography).
Video sitemap extension
Adds <video:video> elements with title, description, thumbnail, and play URL. Useful for video-first sites.
For text-first content sites, the base <url> entries are sufficient. The extensions add work; they are worth it only when image or video content is a primary surface of your site.
5. Sitemap index files
When a site exceeds 50,000 URLs or 50 MB per sitemap, split the URLs across multiple sitemaps and list them in a sitemap index file.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2026-06-01</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-articles.xml</loc>
<lastmod>2026-06-01</lastmod>
</sitemap>
</sitemapindex>
The index file itself is also subject to the 50,000-entry and 50 MB limits. For very large sites, sitemap indexes can be nested (an index of indexes), though Google's documentation describes nested-index handling as less consistent than flat-index handling.
6. Discovery and submission
Crawlers discover the sitemap two ways:
- robots.txt Sitemap directive. The canonical no-account method. Every search engine that respects robots.txt picks up the sitemap from this directive. Use absolute URLs:
Sitemap: https://example.com/sitemap.xml. - Direct submission via webmaster tools. Google Search Console, Bing Webmaster Tools. Submission provides faster feedback (coverage reports) but is not required for crawl discovery.
If you maintain Bing Webmaster Tools, also submit there. Bing's sitemap documentation describes their consumption pattern.
7. Worked example — helperg.com's sitemap.xml
The current helperg.com /sitemap.xml demonstrates a single-file sitemap with about 100 URL entries. The file is hand-maintained for editorial control over what gets indexed. Excerpt:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://helperg.com/</loc>
<lastmod>2026-05-19</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://helperg.com/ai-search/</loc>
<lastmod>2026-06-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
...
</urlset>
Note: the file includes <changefreq> and <priority> for backward compatibility, but Google ignores both. The <lastmod> values are accurate — they reflect when the URL's content actually changed.
8. Common errors
- Invalid XML. Unescaped characters in URLs (especially
&) are the most common cause. URLs must be URL-encoded; ampersands in query strings must be escaped as&in XML. - Non-canonical URLs. Sitemap URLs should match the canonical URLs declared on each page. Mixing http/https or www/non-www causes Google to flag the discrepancy.
- Broken URLs. URLs that return 404 or 500. Validate each URL returns 200 before adding to sitemap.
- Missing HTTPS consistency. If the site serves HTTPS, every sitemap URL should use HTTPS.
- Stale lastmod. Bumping lastmod for unchanged content trains crawlers to ignore the signal.
- Falsified lastmod. Every URL bumped on every deployment regardless of content change is detectable and treated as low-quality signaling.
- Sitemap exceeding size limits. 50,000 URLs or 50 MB uncompressed. Split into multiple sitemaps via a sitemap index.
- Sitemap not advertised in robots.txt. Without the Sitemap directive, discovery depends on crawler defaults and direct submission.
- Including non-indexable URLs. URLs with
noindexmeta tags, URLs blocked by robots.txt, or login-walled pages do not belong in sitemap. - Wrong protocol on sitemap location. The sitemap URL declared in robots.txt must use HTTPS if the site serves HTTPS.
9. Validation workflow
Step 1 — XML parse
Confirm the file is valid XML. Any XML parser will do. helperg.com's sitemap validator covers this.
Step 2 — URL coverage check
Confirm every URL in sitemap returns HTTP 200 and matches its canonical:
grep -oE 'https?://[^<]+' sitemap.xml \
| sort -u \
| while read url; do
code=$(curl -sI --max-time 8 -L "$url" | grep -E '^HTTP' | tail -1 | awk '{print $2}')
printf "%-4s %s\n" "$code" "$url"
done
Step 3 — Search Console coverage
Submit the sitemap in Google Search Console. The coverage report flags URLs that are submitted-but-not-indexed, indexed-not-in-sitemap, and any error categories. Use this for ongoing monitoring.
10. Checklist
- sitemap.xml exists and is reachable at the documented location.
- The file is valid XML (parses without error).
- The file uses the sitemaps.org schema (
http://www.sitemaps.org/schemas/sitemap/0.9). - Every
<loc>is an absolute URL using the canonical scheme. - HTTPS is used consistently.
- Every URL in sitemap returns HTTP 200.
- Every URL in sitemap matches its page's canonical.
<lastmod>values reflect actual content changes.- URLs with
noindexor robots.txt Disallow are not included. - The sitemap is advertised in robots.txt via the
Sitemap:directive. - The sitemap directive in robots.txt uses an absolute URL.
- The file does not exceed 50,000 URLs or 50 MB uncompressed.
- If the site is larger, a sitemap index file lists the child sitemaps.
- The file is version-controlled with the rest of the site.
- The sitemap is submitted in Google Search Console (and Bing Webmaster Tools, if maintained).
- Search Console coverage reports are reviewed on a documented cadence.
- If image or video content is a primary surface, the relevant extension is in use.
- The sitemap generation process is documented (manual or scripted).
- An owner is identified for sitemap accuracy.
- Ampersands in URLs are escaped as
&.
11. FAQ
Where should sitemap.xml live?
At the site root, served as /sitemap.xml. The location is also advertised in robots.txt via the Sitemap: directive. The file lives at the host root because that is where every major search engine looks for it by default.
Do priority and changefreq matter?
Largely no, in 2026. Google has documented that it ignores priority and changefreq when deciding crawl frequency and ranking. The values do not harm anything but they should not be relied on as ranking signals. lastmod accuracy, by contrast, is documented as actively used.
Does lastmod accuracy matter?
Yes. Google has publicly stated that it pays attention to lastmod when accurate, and that systematically falsified lastmod values cause Google to discount the field for the entire site. Update lastmod when the content actually changes; do not bump it on every deployment if content did not change.
What is a sitemap index?
A sitemap index file lists multiple individual sitemap.xml files. Used by larger sites that exceed the per-sitemap limits (50,000 URLs or 50 MB uncompressed). The index is itself an XML file with a sitemapindex root and one sitemap entry per child sitemap.
Should I submit my sitemap to Google?
Submission via Google Search Console is optional. The canonical no-account way to advertise the sitemap is the Sitemap: directive in robots.txt. Submission provides faster Search-Console-visible feedback but is not required for crawl discovery.
Is XML the only sitemap format?
The sitemaps.org protocol also defines plain-text and RSS/Atom formats. XML is the most flexible and most widely used; plain text is acceptable for small sites that only want a URL list. Major search engines accept all three.
Do I need image or video sitemap extensions?
Optional. Google publishes extensions for image and video sitemaps. They are useful when image or video content is a primary surface of your site (gallery sites, video sites). For text-first sites, the base sitemap.xml is sufficient.
What happens if my sitemap has a broken URL?
Google reports it in Search Console as a coverage error. Broken URLs in sitemap reduce trust in the overall sitemap and may slow crawl of legitimate URLs. Validate URLs return 200 before declaring them in sitemap.
12. Sources
- sitemaps.org — The Sitemap protocol — captured 2026-06
- Google Search Central — Sitemap overview — captured 2026-06
- Google Search Central — Build and submit a sitemap — captured 2026-06
- Google Search Central — Robots intro (Sitemap directive) — captured 2026-06
- Google Search Central — Large site crawl budget — captured 2026-06
- Bing Webmaster — Sitemaps — captured 2026-06
- RFC 9309 — Robots Exclusion Protocol (for the Sitemap directive) — captured 2026-06