Web Cache (1) - HTTP Protocol Cache

Why use web caching

Web cache is generally divided into browser cache, proxy server cache and gateway cache. This article mainly talks about browser cache . You can learn about the other two caches by yourself.

Web caches travel between the server and the client. This server may be the source server (the server Add where the resource resides), and the number may be one or more; the client may also be one or more. The web cache is monitoring between the server and the client, monitoring the request, and saving the content of the request output (such as html pages, images and files) (collectively referred to as copies); then, if the next request is the same URL , the saved copy is requested directly instead of bothering the origin server again.

2 main reasons to use cache:

  • Lower latency: The cache is closer to the client, so it takes less time to request content from the cache than from the origin server, renders faster, and the website appears more responsive.
  • Reduce network transmission: The replica is reused, which greatly reduces the bandwidth usage of users. In fact, it is also a disguised way to save money (if the traffic has to be paid), while ensuring that the bandwidth request is at a low level, which is easier to maintain.

Imagine a large website today, any page is one or two hundred requests, and the PV is in the 100 million level every day. If there is no cache, the user experience will drop sharply (in terms of waiting time for requests), and both server pressure and network bandwidth will be reduced. face a serious test.

Browser Cache Control Mechanism

There are three browser cache control mechanisms: HTML5 offline storage and local cache, HTML Meta tag, and HTTP protocol cache.

HTML5 offline storage and local caching

This kind of caching mechanism is to use HTMl5 to launch some new APIs that support offline applications to cache data, such as appcache, sessionStorage, localStorage and so on.

appcache lists the resources to download and cache by defining a manifest file. An example of a manifest file is as follows:

CACHE MANIFEST
# Comment

file.js
file.css

Then reference in html:

<html manifest="./xxx.manifest">

The basic usage of sessionStorage and localStorage is as follows:

// localStorage 用法相似
sessionStorage.set('name', 'laixiangran') // 存储数据
sessionStorage.get('name') // 获取数据 'laixiangran'

This article will not go into details for the time being, and I will introduce this content separately later.

HTML Meta

Using the HTML Meta tag, web developers <head>can add <meta>tags to the HTML page node, the code is as follows:

<META HTTP-EQUIV="Pragma" CONTENT="no-cache">

The function of the above code is to tell the browser that the current page is not cached, and each visit needs to go to the server to pull it.

It's simple to use, but only some browsers support it, and not all caching proxy servers, because the proxy doesn't parse the HTML content itself.

HTTP protocol cache

HTTP protocol caching is the focus of this article. It controls caching through HTTP headers, which give you more control over how browsers and proxy servers handle your copies. They are invisible in the HTML code and are generally automatically generated by the web server. However, depending on the server you use, you can control it to some extent.

Browser request process

Browser first request flow chart:

The process is relatively simple. The browser does not have a cache when the first request is made, and requests directly from the browser. After the request returns the result, the data is cached in the memory or hard disk according to the HTTP header information.

When the browser requests again:

The process is much more complicated. The browser needs to judge whether to read the data directly from the cache or leave it to the server to judge whether to read the data from the cache according to the HTTP header information.

The difference between several status codes:

Next, we will explain the HTTP header information in the HTTP protocol cache from the HTTP status codes 200 (from cache) and 304 that appear in this process.

200(from cache)

This HTTP status code indicates that the server is not accessed, and data is read directly from the cache (memory or hard disk).

Look at the two pictures:

From the above two pictures, we will see that the status codes are a bit different, respectively 200(from memory cache)and 200(from diks cache), the difference between these two is that one is reading data from memory, the other is reading data from hard disk, and then their order is from memory first read, and then read from the hard disk. Here we call it collectively 200(from cache).

200(from cache)In this case, we need to pay attention to theseExpires two HTTP header fields.Cache-control

Expires

Expires means "validity period" in Chinese. Obviously, it is to tell the browser the validity period of the cache. If expired, the cache checks the origin server to determine if the file has changed.

The only valid value of the Expires header is the HTTP time, other values ​​are invalid and will not be cached. Note: Time is Greenwich Mean Time (GMT), not local time. As follows:

Expires: Mon, 29 Oct 2018 03:53:10 GMT

Then look at the Expires in the two pictures above, it expires at 2018-10-29 03:53:10, and the time Date of our request is 2018-04-29 03:53:10, so this request Read data directly from the cache and return 200 (from cache).

Although the Expires header is useful, it has certain limitations:

  • Because of the time involved, the time on the Web server must be synchronized with the cache, otherwise the expected result may not be achieved - the cache regards the expired data as the latest data, and the latest data as the expired data.
  • It is easy to forget to set a specific time for a certain content. If the expiration time is not updated when the content is returned, each request will go to the server, which will increase the load and response time.
  • Finally, Expires is an HTTP 1.0 thing, and now the default browsers use HTTP 1.1 by default, so its role is basically ignored.
Cache-Control

Cache-Control has the same function as Expires. Both indicate the validity period of the current resource and control whether the browser reads data directly from the browser cache or re-sends the request to the server to read the data. It's just that Cache-Control has more choices and more detailed settings. If set at the same time, its priority is higher than Expires.

Useful response headers for Cache-Control include:

  • max-age=[seconds]: Indicates that the cache is fresh and does not need to be updated within this time range. Similar to Expires time, but this time is relative, not absolute. That is, how many seconds the cache is fresh after a request is successful.
  • s-maxage=[seconds]: Similar to max-age, except only applies to shared caches (like proxies).
  • public: Only responses marked with authentication can be cached. In general, HTTP request content that requires authentication is automatically private (not cached).
  • privateN: allows caches to store responses specifically for a user, for example in browsers; shared caches generally do not, for example in proxies.
  • no-cache: Forces a request to be sent to the origin server for verification every time before releasing the cached copy, this is useful to ensure the validity of the authentication (used in conjunction with public) or to ensure that the content must be instant, without ignoring all the advantages of caching , such as the refresh display of domestic Weibo, twitter, etc.
  • no-store: Force the cache not to keep any copies under any circumstances.
  • must-revalidate: Tell the cache that I have prepared some information about freshness for you, and you must strictly follow it when performing it. HTTP allows caches to return expired data in some specific cases, specifying this property, rather than telling caches, you must strictly follow my rules.
  • proxy-revalidate: Similar to must-revalidate, except that only applies to proxy caches.

Use as follows:

Cache-Control: max-age=15811200

Then look at the Cache-Control in the two figures above. It is valid within 15811200 seconds after the current request is successful, so this request reads data directly from the cache and returns 200 (from cache). If the current request is successful, new data will be requested from the server again after 15811200 seconds.

304

When the browser passes Expiresor Cache-controldetermines that the cache has expired, it needs to resend the request to the server, so that the server can determine whether the current cache can continue to be used.

When the server judges that the cache has expired, it will return new data and the HTTP status code is 200;

When the browser determines that the cache has not been invalidated, it will return the HTTP status code of 304 (no package body, saving traffic), telling the browser to continue to use the cache.

So which HTTP header fields are used to determine whether to return 200 or 304? Then we invite the next protagonist: Last-Modified/If-Modified-Sinceand Etag/If-None-Match. Both fields need to be Cache-Controlused .

Last-Modified/If-Modified-Since
  • Last-Modified: Indicates the last modification time of this response resource. When the web server responds to the request, it tells the browser when the resource was last modified.

  • If-Modified-Since: When the resource expires ( Cache-Controlidentified with max-age), it is found that the resource has the Last-Modifieddeclaration , and the If-Modified-Since is attached to the request to the web server again, indicating the request time. After the web server receives the request, it finds that there is an If-Modified-Since and compares it with the last modification time of the requested resource. If the last modification time is newer, indicating that the resource has been modified, it will respond with the content of the resource (written in the response message body), and HTTP 200; body, save traffic), tell the browser to continue using the cache.

Etag/If-None-Match

This is a new validator introduced in HTTP 1.1.

  • Etag: When the web server responds to the request, it tells the browser the unique identifier of the current resource on the server (the generation rule is determined by the server). In Apache, the value of ETag is obtained by hashing the index node (INode), size (Size) and last modification time (MTime) of the file by default.

  • If-None-Match: When the resource expires ( Cache-Controlidentified with max-age), and it is found that the resource has an Etage claim, it will bring If-None-Match (the value of Etag) when requesting the web server again. After the web server receives the request and finds that there is an If-None-Match, it compares it with the corresponding verification string of the requested resource, and decides to return 200 or 304.

Etag takes precedence over Last-Modified

You might think that using Last-Modified is enough to let the browser know if the local cached copy is fresh enough, why do you need an Etag (entity identifier)? The emergence of Etag in HTTP1.1 is mainly to solve several problems that are difficult to solve with Last-Modified:

  • The last modification marked by Last-Modified can only be accurate to the second level. If some files are modified multiple times within 1 second, it will not be able to accurately mark the modification time of the file.

  • If some files will be generated periodically, sometimes the content has not changed, but the Last-Modified has changed, so that the file cannot use the cache.

  • It is possible that the server does not accurately obtain the file modification time, or it is inconsistent with the proxy server time.

Etag is the unique identifier on the server side of the corresponding resource automatically generated by the server or generated by the developer, which can control the cache more accurately. Last-Modified and ETag can be used together. The server will first verify the ETag. If it is consistent, it will continue to compare Last-Modified, and finally decide whether to return 304.

Tips for Creating Cache-Enabled Websites

Through the above introduction, we know the mechanism of the HTTP protocol cache, the purpose is to allow you to control the browser cache more flexibly and in detail, so as to make the cache of your website more friendly and the user experience more perfect.

The following tricks can also make your site more cache friendly:

  • Keep URLs stable: This is the golden rule of caching, if you serve the same content to different pages, different users or different websites, they should use the same URL. This is a simple but very effective method. For example, if a reference address in your HTML is "/index.html", you should always use this address.
  • Use the same library for images and other elements in different places .
  • Enable caching for images/pages that change infrequently by setting the Cache-Control: max-ageheader value to a larger value.
  • Cache is implemented by specifying max-ageor expiration time for regularly updated content .
  • If the resource changes (especially download files), change its name . Because this resource generally has a long expiration time, and the correct version is always on the server; therefore, the page linking this download resource needs a shorter expiration time. Otherwise, the resources of the server are new, but the page is cached, and the link address is still old, and there is a possibility of conflict between the old and new versions.
  • Don't change the file as a last resort: otherwise you'll have to set a new Last-Modifiedvalue . Also, when you update your site, just upload the changed files instead of overwriting the entire site.
  • Cookies can or may not be used: Cookies are difficult to cache and are unnecessary in most situations. If you must use cookies, it is recommended to use them on dynamic pages.
  • Reduced use of SSL: Since the shared cache cannot store authentication pages, it is only used when necessary, and the use of images on SSL pages is reduced.

SSL: the full name of Secure Socket Layer – Secure Socket Layer, developed by Netscape to ensure the security of data transmission on the Internet, using data encryption (Encryption) technology to ensure that data will not be intercepted and tapping. At present, the general specification is the 40-bit security standard, and the United States has introduced a 128-bit higher security standard, but it is restricted from leaving the country. As long as IE or Netscape browsers above version 3.0 can support SSL.

  • Checking your website with REDbot : can help you apply some of the concepts introduced in this article.

REDbot: REDbot = RED + robot, is a robot that inspects HTTP resources to see how they behave, points out common problems, and suggests improvements. Although it is an HTTP conformance tester, it can find quite a few HTTP related issues.

User behavior and caching

Some user behaviors will affect the browser's cache, as follows:

Complete flow chart

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325078094&siteId=291194637