Preamble

This document specifies the format of an XML-based sitemap file, as used by the Standard-Sitemap Protocol (SSP). The level of detail offered is aimed at developers of sitemap-aware software, so that implementations can produce consistent behaviour. Consequently, authors experimenting with sitemaps can understand why their sitemap file did or did not produce the desired effects, and can be sure that it will be interpreted correctly in all contexts.

Some (partial) example sitemaps are available.

Semantics

An SSP sitemap expresses several aspects of a site:

Nodes in the hierarchy are referred to as hierarchy nodes. Hierarchy nodes may be either the root node, item nodes or group nodes. Nodes with special meanings are known as role nodes. A node may both have a role and exist in the hierarchy.

A sitemap is represented by one or more SSP sitemap XML files. A URI identifies one of them, the root file, and this may reference other files to complete the representation of the sitemap.

Nodes and their qualities

As part of the hierarchy, each node may contain zero or more other nodes, and each node has at least one parent (except for the anonymous root node, which has no parent).

A node has zero or more roles, a relation and a priority. A node has one or more variants.

Reachability and exclusion of elements

In the node hierarchy, the root node is represented by a <sitemap> element, item nodes are represented by <item> elements, and group nodes are represented by <group> elements. Role nodes are similarly represented by <item> elements. Note that the elements that represent hierarchy nodes themselves form a hierarchy due to the nature of XML, and this hierarchy is not necessarily congruent with the node hierarchy.

With respect to a given root file, elements may be reached. Some elements may be excluded. Only elements that are reached may represent role nodes, and only elements that are reached and not excluded may represent hierarchy nodes.

  • If the root file’s document element is a <sitemap> element, it is reached, and represents the root node.

  • If a <sitemap>, <item> or <group> element is reached and not excluded, its <item> and <group> children are also reached. Hierarchy nodes represented by these reached elements can be children of the node represented by their parent element.

  • If a <sitemap>, <item> or <group> element is reached and not excluded, and contains <external> elements that reference <item> and <group> elements, the referenced elements are also reached. Hierarchy nodes represented by these reached elements can be children of the node represented by the parent of the referencing <external> element.

Despite being reached, an element may be excluded from the node hierarchy—i.e. it represents no node in the hierarchy—under any of the following conditions:

  • The element has a tree attribute with the value exclude.

  • The element has a tree value of auto, and its role set includes neither tree nor any unrecognised role.

  • The element has a tree value of user, and its role set includes neither tree nor any unrecognised role, and the user has specified that such elements should be excluded (usually by a configuration option).

  • The element is an <item>, whose represented variants includes at least one with no ‘name’ quality.

  • The element is an <item>, has no non-excluded <item> or <group> children, has no <external> children referencing non-excluded <item> or <group> elements, and whose represented variants includes at least one with no ‘location’ quality.

Note that it is possible to compute an element’s exclusion and set of represented variants without knowing whether it has been reached.

Although such excluded elements have been reached, and indeed can represent role nodes, they do not allow their own descendant elements to be reached. (This does not prevent those descendants from being reached by other means.)

Given that the default value of role is tree, and that the default value of tree is auto, an ordinary <item> or <group> element whose parent is reached will also be reached and not excluded, because the role set includes tree. It therefore will appear as a hierarchy node.

Node variants and their qualities

Each node variant can have some of the following qualities:

name
a short text acting as the name or title of the node, as specified by the name attribute
description
a longer text describing the node, possibly distinguishing it from other nodes with similar or identical names, as specified by the description attribute
location
the address (URI) of the variant’s content, as specified by the url attribute
language
the language in which the content at the variant’s location is written, as specified by the lang attribute
character encoding
the octet-to-character translation by which the content at the variant’s location is encoded, as specified by the charset attribute
content type
the format of the content at the variant’s location, as specified by the type attribute
search template
the format of a query that can be issued to the node’s location for searching the site, as specified by the data attribute
search method
the HTTP method of a query that can be issued to the node’s location for searching the site, as specified by the method attribute
referrer
the choice of HTTP referrer URI when fetching a node’s ‘location’, as specified by the refer attribute

An <item> or <group> element (a node element) serves as the root of a local hierarchy, with all other elements being <variant>s. Altogether, these specify all of the corresponding node’s variants.

The node element may specify qualities through the XML attributes listed above. Each contained <variant> inherits the qualities of its parent, and may introduce qualities through its own attributes. Finally, each leaf of the local hierarchy specifies a single variant of its corresponding node, with the qualities accumulated by its ancestry.

<item name="FAQ" url="faq.html">
  <variant description="Frequently asked questions" lang="en" />
  <variant description="Oftaj demandoj" lang="eo" />
</item>

In the example above, an <item> defines the ‘name’ and ‘location’ qualities common to its two variants. Their ‘description’ qualities, however, are language-dependent.

All variants of an item node must have the ‘name’ quality, or the node and its children need not appear in the node hierarchy. All variants of a childless item node must have the ‘location’ quality, or the node need not appear in the node hierarchy.

All variants of an item node with search in its role set must have the ‘search template’ quality, or the node cannot fulfil the search role.

A group node cannot have a ‘location’ quality.

Roles

An item node may have several roles, as determined by its representing element’s role attribute, and partly by that element’s position in the root sitemap file.

The following roles are defined:

home
The node refers to a home page or start page from which the user may restart navigation.
contentinfo
The node refers to a page describing how the site was made or published, and who is responsible for its upkeep and accuracy.
contact
The node refers to a page giving contact information for users of the site.
search
The node refers to a family of pages permitting a site-limited search.
searchpage
The node refers to a conventional search page.
tree
The node takes part in the node hierarchy.

Implementations are free to define other roles, but should do so by agreement with future versions of this specification, or by placing those roles in a private namespace (a mechanism for which is yet to be defined).

In its sidebar, the Firefox extension uses the search role to configure the search field, but ignores searchpage. Meanwhile, its pop-up menu for the customizable toolbars ignores search, but presents searchpage under its Search item.

An <item>’s role attribute specifies a space-separated list of role names. This is an initial set of roles that any node represented by this element can fulfil. The default list is tree, so setting the attribute to another single value implies that the represented node should not appear in the tree, i.e. it is excluded.

The first <item> in the <sitemap> element of the root file additionally takes on the role home, if no other <item> is reached with that explicit role. Note that this does not exclude that element from the node hierarchy, as exclusion is defined in terms of the actual value of the role attribute, not the set of roles that a node ultimately fulfils.

Site-limited search

The search role makes use of a node’s ‘search method’ and ‘search template’ qualities. An implementation may use it to provide the user with a ‘site-limited search’. After accepting a search term, the user agent may visit a URI formed from the node’s ‘location’, using the template resolved against the search term and the root file’s URI as defined under the data attribute.

If the ‘search method’ is get, a query ? and the resolved template are resolved against the location, and the user agent visits that address with an HTTP GET request. Otherwise, the ‘search method’ is post, and the resolved template is POSTed to the node’s location as application/x-www-form-urlencoded. The ‘search method’ is specified by the method attribute.

For example:

<item   role="search"
        name="Search"
 description="Search"
         url="/cgi-bin/search"
    xml:base="http://www.example.foo/juice/"
        data="q=%s" />

This item is only activated by filling in the search field, and does not appear in the source tree (no tree role). A search query of foo invokes a GET http://www.example.foo/cgi-bin/search?q=foo.

<item   role="search searchpage"
      method="post"
        name="Search"
 description="Search"
         url="/cgi-bin/search"
    xml:base="http://www.example.foo/juice/"
        data="q=%s" />

In the variation above, a POST http://www.example.foo/cgi-bin/search is issued, with q=foo as the content. Also, the address http://www.example.foo/cgi-bin/search may be listed as the search page.

<item   role="search tree"
        name="Search"
 description="Search"
         url="search"
    xml:base="http://www.example.foo/juice/"
        data="q=%s" />

Finally, in this variation, GET http://www.example.foo/juice/search?q=foo is issued when the term foo is sought, and the item search also appears in the tree.

Data types

Language code

A language code is a case-insensitive string identifying a natural language, possibly a specific regional dialect. For example:

  • en identifies English.
  • de identifies German.
  • en-gb identifies English as spoken in Great Britain.
  • de-at identifies German as spoken in Austria.

SSP language codes follow the same format as HTML language codes. The first component is an ISO 639:1988 two-letter code. The second, if present, is an ISO 3166:1993 country code.

Character encoding

A character encoding (or ‘charset’, informally) specifies translation between octets and characters. For example:

  • US-ASCII
  • UTF-8
  • ISO-Latin-1

These names are registered under IANA character sets.

Content type

A content type specifies the nature of a resource. For example:

  • image/png identifies the PNG format for images.
  • text/html identifies HTML.
  • application/pdf identifies PDF.

These names are registered under IANA Media Types.

URI reference

This is a string as defined by RFC3986: Uniform Resource Identifier (URI): Generic Syntax

Element ID

This is an element identifier as defined in xml:id Version 1.0. It is a case-insensitive string consisting of letters, digits, underscores, dashes, and dots.

Elements

Element <sitemap>

The <sitemap> element is the root of an SSP sitemap document.

Element <item>

The <item> element represents an ordinary node. Many of its attributes set the default qualities for the node’s variants, if it has any <variant> children, or set the qualities of the node’s sole variant.

Element <group>

The <group> element expresses a node of lesser prominance in the hierarchy. It need not have a name, and has no location. Its children may be rendered as if they were children of the <group>’s parent, and cannot be folded away separately from that parent’s other children. Many of its attributes set the default qualities for its node’s variants, if it has any <variant> children, or set the qualities of the node’s sole variant.

Element <external>

  • Namespace: http://standard-sitemap.org/2007/ns
  • Attributes:
    • url (required)
  • Content: empty
  • Child of:

The <external> element identifies an <item>/<group> element in the same or another document. The value of the url attribute is resolved against the element’s base, and identifies the <item>/<group> by its xml:id attribute.

The node represented by the referenced <item>/<group>, including its variants and their qualities, its child nodes, and its role, relation and priority, becomes a child of the node represented by the element containing the referencing <external> element.

Element <variant>

The <variant> element allows variants of a node to be specified. Each <variant>’s nearest <item> or <group> ancestor represents that node as a whole, and is the <variant>’s node element. Each <variant> may therefore appear in <item>, <group>, or other <variant>s.

Each <variant> that has no child elements specifies a variant of its node. Attributes of such a <variant> specify the variant’s qualities. For attributes that are not set on the <variant> itself, the qualities are derived from the corresponding attributes of the nearest ancestors that set them (i.e. they are inherited), with the following restrictions. These attributes may be inherited from any ancestor:

Other attributes may only be inherited from the node element or its children.

Element <class-change>

  • Namespace: http://standard-sitemap.org/2007/ns
  • Attributes:
  • Content: empty
  • Child of:

The <class-change> element specifies how a document served by the sitemap should be modified to indicate that it is being so served. The XPath expression specified by elem identifies an element in the served document to be modified. attr identifies an attribute on that element to be modified. The attribute prefix identifies the prefix of a family of class names to be updated and maintained in the attribute value, according to a display levels in the range [0,100].

Whenever the display level is set to N, the attribute is modified so that its set of classes of the form prefix-over-integer and prefix-under-integer consists of exactly 100 items:

  • prefix-over-0 upto prefix-over-L, where L=N−1
  • prefix-under-M upto prefix-over-100, where M=N+1

Authors are expected to use these changes to dynamically alter the styling of their site in the distinct cases of being visited by a sitemap-aware user agent and a sitemap-unaware user agent.

Attributes

Attribute attr

  • Value: attribute name
  • Default: class
  • Appears on:

This specifies the name of an attribute of the element specified by elem, whose value should be managed as the served page’s display level is changed.

The attribute name may include a namespace prefix, for any prefix in effect on the <class-change> element.

Attribute charset

This attribute specifies the ‘character encoding’ quality of the node variant it applies to.

Attribute data

  • Value: URI query string template
  • Variant quality: search template
  • Appears on:
    • <item> (required on every variant when role includes search)

This attribute specifies the ‘search template’ quality of the node variant it applies to. This string specifies the template for the query data used in a site-limited search. Various % expressions are replaced by strings according to the following table lists, showing the result of applying them to an example sitemap address of:

http://www.example.com/a/b/c/standard-sitemap.xml

…and an example search query fish.

Expression Meaning Example
%s Search term fish
%w Website home (./ resolved against sitemap address) http://www.example.com/a/b/c/
%(1w) Parent (../ resolved against sitemap address) http://www.example.com/a/b/
%(2w) Parent (../../ resolved against sitemap address) http://www.example.com/a/
%r Root (/ resolved against sitemap address) http://www.example.com/
%h Host of sitemap address www.example.com

All expanded values are escaped as if they are URI query values.

Attribute description

This attribute specifies the ‘description’ quality of the node variant it applies to. This should be a one- or two-line description, and will likely appear as a toolip of a menu item or navigation tree.

Attribute elem

  • Value: XPath expression
  • Appears on:

This identifies an element whose attribute should be managed as the served page’s display level is changed. Only the first element that matches the expression is modified.

The XPath expression may be written in terms of any namespace prefixes in effect on the <class-change> element.

Attribute lang

This attribute specifies the ‘language’ quality of the node variant it applies to.

Attribute method

This attribute specifies the ‘search method’ quality of the node variant it applies to. It specifies the HTTP request method, GET or POST, to be used when performing a site-limited search.

Attribute name

This attribute specifies the ‘name’ quality of the node variant it applies to. This should be a relatively short name, as it will likely appear as the text of a menu item or navigation tree.

Attribute order

  • Value: none, lexical, base10, base16, version
  • Default: none
  • Appears on:

This attribute allows an implementation to make certain assumptions about the ordering-by-name of child nodes of the node represented by this element.

none
The implementation can make no assumptions about the ordering of child nodes.
lexical
The child nodes are lexically ordered by name.
base10
The child nodes are numerically ordered by name, in radix 10.
base16
The child nodes are numerically ordered by name, in radix 16.
version
The child nodes are ordered by name, and each name is a hierarchical version number (e.g. 1.3.1).

If an ordering exists, and there are many nodes for the implementation to render, it may automatically group and subgroup them, and use their names to work out the names of the synthetic groups. This allows the implementation to choose the size of such groups according to local requirements, e.g. the number that fit comfortably into a screen.

For example, a large number of programming symbols could be divided into several synthetic groups whose names are formed by taking the name of the first node in the group, appending an ellipsis, and then appending the name of the last node in the group. This node:

<group name="Defined symbols" order="lexical">
  <item name="abort" .../>
  <item name="abs" .../>
  <item name="acos" .../>
  <!-- 900 or so other items, in alphabetic order -->
  <item name="wscanf" .../>
  <item name="xor" .../>
  <item name="xor_eq" .../>
</group>

…could be divided like this:

  • abort…atoi
  • atol…catanf
  • catanh…compl
  • …30 groups of about 30 items each…
  • uint_fast8_t…vprintf
  • vscanf…wcsncat
  • wcsncmp…xor-eq

Attribute prefix

  • Value: class-name prefix
  • Appears on:

This identifies a family of class names of the form prefix-over-integer and prefix-under-integer which should be managed on an attribute of an element in the served page as its display level is changed.

Attribute priority

  • Value: decimal [0.0, 1.0]
  • Default: decimal 0.5
  • Appears on:

This attribute allows the author to set the relative priorities or levels of importance of nodes across the site. Implementations may represent this differences by, for example, emboldening names of more important nodes, or changing font size as appropriate.

Attribute refer

  • Value: space-separated list of user, root, map, page, parent, none
  • Default: user
  • Appears on:

This attribute specifies the ‘referrer’ quality of the node variant it applies to. The following values are permitted:

root
The referrer shall be the URI of the root file.
map
The referrer shall be the URI of the sitemap file containing the <item> element that represents this node.
page
The referrer shall be the URI of the current page.
parent
The referrer shall be the URI of the page identified by this node’s parent. Implementations are not required to support this, as it might make the UA appear to be a referrer spammer; such implementations should interpret this as user instead. Furthermore, this value only has meaning if a node is accessed as part of the tree of nodes presented to the user. If instead it is an extract role, the implementation should treat it as user.
none
There shall be no referrer URI.
user
The UA shall determine the referrer URI.

If an implementation does not recognize this attribute, it should behave as if user was specified. When multiple values are specified, the implementation should behave according to the first it can honour.

Attribute relation

  • Value: none, sequence
  • Default: none
  • Appears on:

This attribute specifies whether the children of the node it specifies form some sort of sequence, and should be navigable as such. For example, an implementation may provide ‘previous’ and ‘next’ buttons to traverse them rapidly.

none
The child nodes do not form a navigable sequence.
sequence
The child nodes form a navigable sequence.

Attribute role

  • Value: spaced-separated token list
  • Default: tree
  • Appears on:

This specifies a set of potential roles for the node represented by the containg element.

For example:

<item name="Contact"
       url="contact.html"
      role="contact contentinfo" />

Attribute tree

  • Value: include, exclude, auto, user
  • Default: auto
  • Appears on:

This attribute specifies an element’s participation in the sitemap’s node hierarchy.

include
If the <item> is reached from the root file, it will represent a node in the hierarchy.
exclude
The <item> will not represent a node in the hierarchy.
auto
If the <item> is reached from the root file, and has a role set including tree or an unknown role, it will represent a node in the hierarchy.
user
If the <item> is reached from the root file, and has a role set including tree or an unknown role, or the user prefers it, it will represent a node in the hierarchy.

exclude allows an author to prevent a node from appearing in the hierarchy, while allowing it to fulfil a role (e.g. search). For example, suppose you have separate pages for showing a blank search form (search.html) and displaying results (search-results.cgi). The results page should never be accessed without a query string, so it must never appear in the node hierarchy:

<!-- The search form is just a normal page. -->
<item name="Search this site"
       url="search.html" />

<!-- The search results without a query are
     always excluded from the hierarchy. -->
<item name="Search this site"
       url="search-results.cgi"
      role="search"
      data="q=%s"
      tree="exclude" />

Using the default setting auto would not guarantee that the <item> would be excluded.

include allows an author to override the automatic removal of nodes from the hierarchy by virtue of them also fulfilling roles. user allows the author to defer that overriding according to the user’s preference.

Attribute type

This attribute specifies the ‘content type’ quality of the node variant it applies to.

<!-- The search form is just a normal page. -->
<item name="Specification">
  <variant type="text/html" url="spec.html" />
  <variant type="application/pdf" url="spec.pdf" />
</item>

Attribute url

This attribute specifies the ‘location’ quality of the node variant it applies to. As a URI reference, it is resolved against the base URI of its element.

Attribute xml:base

This attribute overrides the base URI of an element. The base URI of the root element of a document is the document URI, and the base URI of any other element is the base URI of its parent. However, if an element specifies xml:base, its value is first resolved against the element’s base as it would be if xml:base were not specified, and that resolved value becomes the element’s base URI (and, therefore, the base URI of descendant elements that don’t specify xml:base).

This is in accordance with XML Base.

Authors should be aware that xml:base influences the url attribute of <external> elements.

Attribute xml:id

This attribute gives its element an identity unique within the document, in accordance with xml:id Version 1.0. In a sitemap file, this primarily exists to support the <external> element. If an element is given an identifier as follows:

<item xml:id="fred" ... />

…then it can be referenced from within the same document like this:

<external url="#fred" />

It can also be referenced from another document:

<external url="freds-document.xml#fred" />

Element index

Attribute index