sitemap.xml — Complete Guide

Last updated: June 1, 2026

What this guide is and is not. Protocol-anchored reference for sitemap.xml: the sitemaps.org protocol, required and optional fields, lastmod accuracy, sitemap index files, image/video extensions, common errors, and validation. Worked example uses helperg.com's live /sitemap.xml. For a task-focused workflow, see How to Optimize sitemap.xml.

1. Sitemap basics

The Sitemap protocol is specified at sitemaps.org. Google, Bing, and other major search engines consume the format. A sitemap is a machine-readable inventory of URLs you want crawlers to discover.

Per the protocol, a single sitemap file can contain up to 50,000 URLs or be up to 50 MB uncompressed. Larger sites use a sitemap index file that lists multiple individual sitemaps; the per-sitemap limits still apply to each child file.

The file lives at the site root by convention (/sitemap.xml) but the actual location is advertised in robots.txt via the Sitemap: directive. Crawlers also check the default location.

2. Fields — required and optional

Per the sitemaps.org protocol, each <url> entry supports four fields:

FieldRequired?Purpose
<loc>RequiredThe URL. Must be absolute, must use the canonical scheme (HTTPS where applicable), must be URL-encoded if it contains special characters.
<lastmod>OptionalLast modification date. Use accurate values; Google publishes that it uses lastmod when accurate and discounts it for the entire site when it detects systematic falsification.
<changefreq>OptionalSuggested change frequency. Google has documented that it ignores this field. Other crawlers vary.
<priority>OptionalRelative priority (0.0 to 1.0). Google has documented that it ignores this field. Most major crawlers also ignore it.

Practical recommendation: include <loc> and an accurate <lastmod>. Skip <priority> and <changefreq> or include them but do not rely on them.

3. lastmod accuracy

Of the optional fields, <lastmod> is the one Google has explicitly stated it pays attention to. The corollary, also documented by Google: if Google detects systematic falsification of lastmod values (every URL bumped on every deployment regardless of content change), Google may discount lastmod for the entire site.

The honest pattern is to bump lastmod when the content actually changes meaningfully — new sections, substantive edits, updated facts. Do not bump it for whitespace changes, deployment cycles, or unchanged content. The signal's value depends on its accuracy.

Format: ISO 8601 date or date-time. 2026-06-01 is valid. 2026-06-01T12:00:00Z is also valid for time-of-day precision.

4. Image and video extensions

The sitemaps.org protocol allows extensions. Google publishes documented extensions for images and videos.

Image sitemap extension

Adds <image:image> elements under a <url> entry. Useful when image content is a primary surface (gallery sites, e-commerce product images, news photography).

Video sitemap extension

Adds <video:video> elements with title, description, thumbnail, and play URL. Useful for video-first sites.

For text-first content sites, the base <url> entries are sufficient. The extensions add work; they are worth it only when image or video content is a primary surface of your site.

5. Sitemap index files

When a site exceeds 50,000 URLs or 50 MB per sitemap, split the URLs across multiple sitemaps and list them in a sitemap index file.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-06-01</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-articles.xml</loc>
    <lastmod>2026-06-01</lastmod>
  </sitemap>
</sitemapindex>

The index file itself is also subject to the 50,000-entry and 50 MB limits. For very large sites, sitemap indexes can be nested (an index of indexes), though Google's documentation describes nested-index handling as less consistent than flat-index handling.

6. Discovery and submission

Crawlers discover the sitemap two ways:

If you maintain Bing Webmaster Tools, also submit there. Bing's sitemap documentation describes their consumption pattern.

7. Worked example — helperg.com's sitemap.xml

The current helperg.com /sitemap.xml demonstrates a single-file sitemap with about 100 URL entries. The file is hand-maintained for editorial control over what gets indexed. Excerpt:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://helperg.com/</loc>
    <lastmod>2026-05-19</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://helperg.com/ai-search/</loc>
    <lastmod>2026-06-01</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
  ...
</urlset>

Note: the file includes <changefreq> and <priority> for backward compatibility, but Google ignores both. The <lastmod> values are accurate — they reflect when the URL's content actually changed.

8. Common errors

  1. Invalid XML. Unescaped characters in URLs (especially &) are the most common cause. URLs must be URL-encoded; ampersands in query strings must be escaped as &amp; in XML.
  2. Non-canonical URLs. Sitemap URLs should match the canonical URLs declared on each page. Mixing http/https or www/non-www causes Google to flag the discrepancy.
  3. Broken URLs. URLs that return 404 or 500. Validate each URL returns 200 before adding to sitemap.
  4. Missing HTTPS consistency. If the site serves HTTPS, every sitemap URL should use HTTPS.
  5. Stale lastmod. Bumping lastmod for unchanged content trains crawlers to ignore the signal.
  6. Falsified lastmod. Every URL bumped on every deployment regardless of content change is detectable and treated as low-quality signaling.
  7. Sitemap exceeding size limits. 50,000 URLs or 50 MB uncompressed. Split into multiple sitemaps via a sitemap index.
  8. Sitemap not advertised in robots.txt. Without the Sitemap directive, discovery depends on crawler defaults and direct submission.
  9. Including non-indexable URLs. URLs with noindex meta tags, URLs blocked by robots.txt, or login-walled pages do not belong in sitemap.
  10. Wrong protocol on sitemap location. The sitemap URL declared in robots.txt must use HTTPS if the site serves HTTPS.

9. Validation workflow

Step 1 — XML parse

Confirm the file is valid XML. Any XML parser will do. helperg.com's sitemap validator covers this.

Step 2 — URL coverage check

Confirm every URL in sitemap returns HTTP 200 and matches its canonical:

grep -oE 'https?://[^<]+' sitemap.xml \
  | sort -u \
  | while read url; do
      code=$(curl -sI --max-time 8 -L "$url" | grep -E '^HTTP' | tail -1 | awk '{print $2}')
      printf "%-4s  %s\n" "$code" "$url"
    done

Step 3 — Search Console coverage

Submit the sitemap in Google Search Console. The coverage report flags URLs that are submitted-but-not-indexed, indexed-not-in-sitemap, and any error categories. Use this for ongoing monitoring.

10. Checklist

11. FAQ

Where should sitemap.xml live?

At the site root, served as /sitemap.xml. The location is also advertised in robots.txt via the Sitemap: directive. The file lives at the host root because that is where every major search engine looks for it by default.

Do priority and changefreq matter?

Largely no, in 2026. Google has documented that it ignores priority and changefreq when deciding crawl frequency and ranking. The values do not harm anything but they should not be relied on as ranking signals. lastmod accuracy, by contrast, is documented as actively used.

Does lastmod accuracy matter?

Yes. Google has publicly stated that it pays attention to lastmod when accurate, and that systematically falsified lastmod values cause Google to discount the field for the entire site. Update lastmod when the content actually changes; do not bump it on every deployment if content did not change.

What is a sitemap index?

A sitemap index file lists multiple individual sitemap.xml files. Used by larger sites that exceed the per-sitemap limits (50,000 URLs or 50 MB uncompressed). The index is itself an XML file with a sitemapindex root and one sitemap entry per child sitemap.

Should I submit my sitemap to Google?

Submission via Google Search Console is optional. The canonical no-account way to advertise the sitemap is the Sitemap: directive in robots.txt. Submission provides faster Search-Console-visible feedback but is not required for crawl discovery.

Is XML the only sitemap format?

The sitemaps.org protocol also defines plain-text and RSS/Atom formats. XML is the most flexible and most widely used; plain text is acceptable for small sites that only want a URL list. Major search engines accept all three.

Do I need image or video sitemap extensions?

Optional. Google publishes extensions for image and video sitemaps. They are useful when image or video content is a primary surface of your site (gallery sites, video sites). For text-first sites, the base sitemap.xml is sufficient.

What happens if my sitemap has a broken URL?

Google reports it in Search Console as a coverage error. Broken URLs in sitemap reduce trust in the overall sitemap and may slow crawl of legitimate URLs. Validate URLs return 200 before declaring them in sitemap.

12. Sources