Sitemap agreement for website SEO optimization

The Sitemap protocol is used to inform search engines of the valid pages that the site allows to crawl. In the simplest implementation, it is an XML file composed of the page URL and its additional attributes (such as modification time, page importance, etc.). Using the sitemap protocol can only provide better support for search engine crawling, but it does not guarantee that the search engine will crawl according to the data set by the protocol. In addition, the sitemap protocol also allows formats such as RSS, plain text, etc. In this article, we only use the XML format.

The Sitemap protocol stipulates that XML files need to meet entity escaping and use UTF-8 as the encoding. In addition, they need to meet the following conditions:

  • Must <urlset>start and </urlset>end (except for XML document declaration), and must declare protocol standards (for example http://www.sitemaps.org/schemas/sitemap/0.9);
  • Each URL is <url>represented by a label;
  • There <url>must be a <loc>sub-label in each label;
  • The sitemap file can only support up to 50,000 links and the file size must be kept below 50MB (for faster transmission, the sitemap supports gzip for compression).

The following is an example of a simple sitemap file:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.example.com/</loc>
    <lastmod>2020-12-15</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
  <!-- ... -->
</urlset>

Sitemap protocol label description

Label name Do you have to Description
urlset Yes The root element of the Sitemap.
url Yes URL parent node represents, inter <urlset>alia tags are <url>child elements of the tag.
loc Yes Indicates the link URL of the page, which needs to httpstart with an agreement (for example, start) and /end, and the overall length needs to be less than 2048 characters.
lastmod no The last modification time expressed in W3C Datetime coding standard or YYYY-MM-DDformat.
changefreq no Indicates the interval between page content changes, which can tell search engine spiders how often they need to re-crawl the content (the actual time interval is determined by the search engine).
priority no Indicates the priority of the page in the website. Its value is between 0.0~ . If it is 1.0not set, the default is 0.5. This value only affects the page of this site, and does not affect the position in the search engine display results.

For changefreq, there are the following allowed values:

  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never

Among them, it alwaysmeans that the content will change every time you visit the page, and it nevermeans the archived content.

* Google document mentioned that Google will ignore <priority>and <changefreq>value, if the site can only be submitted to Google to ignore these two labels.

Google ignores <priority> and <changefreq> values, so don’t bother adding them.

Entity escape

For the following characters that appear outside the label in the file, they need to be escaped in order to be correctly expressed:

character Representation after escape
& &amp;
' &apos;
" &quot;
> &gt;
< &lt;

For all URLs, it is necessary to meet the RFC-3986 URI standard, the RFC-3987 IRI standard, and the XML standard . For non-ASCII characters (such as Chinese characters) that appear in the URL, they also need to be escaped, for example:

https://example.com/示例.html/

For this URL, it needs to be escaped into the following form:

https://example.com/%E7%A4%BA%E4%BE%8B.html/

sitemap index file

For a single sitemap file, only up to 50,000 URLs are supported, and its size must be less than 50MB. If the site contains more than 50,000 URLs or its size exceeds 50MB, you need to create multiple sitemap files and use sitemap index files.

Similarly, the sitemap index file is also a UTF-8 encoded XML file that meets entity escaping, and meets the following conditions:

  • To <sitemapindex>begin and to </sitemapindex>end;
  • <sitemap>Each <sitemap>tag is used to represent the sitemap file, and each tag must contain an <loc>element to indicate the location of the sitemap file;
  • The sitemap index file supports a maximum of 50,000 sitemap files, and its size must be kept below 50MB.
Label name Do you have to Description
<sitemapindex> Yes The root element of the sitemap index file.
<sitemap> Yes Used to encapsulate each sitemap file link, which is <sitemapindex>the parent element of other elements.
<loc> Yes The location of the sitemap file can be an XML file of the sitemap protocol, or a format such as RSS or plain text, or a file compressed by gzip.
<lastmod> no The modification time that meets the W3C Datetime standard.

Similarly, for each sitemap file in the sitemap index file, it is also necessary to satisfy 50,000 or less links and a file size of less than 50MB.

The following is an example of a simple sitemap index file:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http://www.example.com/sitemap1.xml</loc>
      <lastmod>2020-10-01T00:00:00+00:00</lastmod>
   </sitemap>
   <sitemap>
      <loc>http://www.example.com/sitemap2.xml.gz</loc>
      <lastmod>2020-10-01T00:00:00+00:00</lastmod>
   </sitemap>
   <!-- ... -->
</sitemapindex>

Multi-site support

The Sitemap protocol also supports the inclusion of multiple subdomains or multiple domain names in the sitemap file. If there are multiple domain names, it can be achieved by using a single sitemap file or distinguishing multiple sitemaps.

For example, you can declare in a single sitemap file:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://host1.example.com/</loc>
  </url>
  <url>
    <loc>http://host2.example.com/</loc>
  </url>
  <url>
    <loc>http://host1.example1.com/</loc>
  </url>
</urlset>

But it should be noted that usually search engines only support domain names that have been authenticated under the same account. For Google, for example, all domain names need to be verified in Google Search Console.

Reference

Guess you like

Origin blog.csdn.net/ghosind/article/details/111713141