HTTP caching mechanism

Web caching can be roughly divided into: database caching, server-side caching (proxy server caching, CDN caching), and browser caching.

Browser cache also contains many things: HTTP cache, indexDB, cookies, localstorage, etc. Here we only discuss HTTP caching related content.

Before we get into HTTP caching in detail, let's clarify a few terms:

  • Cache hit ratio: The ratio of the number of requests to get data from the cache to the total number of requests. Ideally, the higher the better.
  • Expired Content: Content that is marked as "stale" beyond the set expiration time. Usually expired content cannot be used to reply to the client's request, and the origin server must be re-requested for new content or to verify that the cached content is still ready.
  • Verification: Verify whether the expired content in the cache is still valid, and refresh the expiration time if the verification passes.
  • Invalidation: Invalidation is the removal of content from the cache. When the content changes, the invalid content must be removed.

Browser caching is mainly a caching mechanism defined by the HTTP protocol. HTML meta tags such as

<META HTTP-EQUIV="Pragma" CONTENT="no-store">

The meaning is to let the browser not cache the current page. However, the proxy server does not parse HTML content, and it is generally used to control the cache with HTTP header information.

 

Browser Cache Classification

Browser cache is divided into strong cache and negotiation cache. The simple process of browser loading a page is as follows:

  1. The browser first judges whether the strong cache is hit according to the http header information of the resource. If hit, the resource is directly added to the cache, and the request is not sent to the server.
  2. If the strong cache is missed, the browser will send the resource load request to the server. The server determines whether the browser's local cache is invalid. If available, the server will not return resource information, and the browser will continue to load resources from the cache.
  3. If the negotiated cache is missed, the server returns the complete resource to the browser, the browser loads the new resource, and the cache is updated.

 

Strong cache

When a strong cache is hit, the browser does not send the request to the server. In Chrome's developer tools, the return code of http is 200, but it will be displayed as (from cache) in the Size column.

Strong caching is controlled by the Expires or Cache-Control fields in the HTTP return header, which are used to indicate the cache time of resources.

Expires

The cache expiration time is used to specify the time when the resource expires, which is a specific time point on the server side. That is to say, Expires=max-age + request time, which needs to be used in conjunction with Last-modified. But as we mentioned above, cache-control has a higher priority. Expires is a web server response message header field, which tells the browser that the browser can directly fetch data from the browser cache before the expiration time without having to request again.

 

This field will return a time such as Expires:Thu,31 Dec 2037 23:59:59 GMT. This time represents the expiration time of the resource, which means that it is valid until 23:59:59 on December 31, 2037, that is, the cache hits. This method has an obvious disadvantage. Since the invalidation time is an absolute time, when the local time of the client is modified and the time deviation between the server and the client becomes larger, the cache will be confused. So Cache-Control was developed.

Cache-Control

Cache-Control is a relative time, such as Cache-Control: 3600, which means that the validity period of the resource is 3600 seconds. Since it is a relative time, and it is compared with the client time, the server and client time deviation will not cause problems.
Cache-Control and Expires can be enabled at the same time or either one of them can be enabled in the server configuration . When enabled at the same time, Cache-Control has a high priority .

Cache-Control can be composed of multiple fields, which mainly have the following values:

1.  max-age  specifies a length of time during which the cache is valid, in s. For example, set Cache-Control:max-age=31536000, which means that the cache is valid for (31536000 / 24 / 60 * 60) days. When accessing this resource for the first time, the server also returns the Expires field, and the expiration time is one year later.

 

If the cache is not disabled and the valid time is not exceeded, accessing the resource again will hit the cache, instead of requesting the resource from the server, but directly fetching it from the browser cache.

2.  s-maxage  is the same as max-age, covering max-age and Expires, but only for shared caches and ignored in private caches.

3.  public  indicates that the response can be cached by any object (client sending the request, proxy server, etc.).

4.  Private  indicates that the response can only be cached by a single user (may be an operating system user, a browser user), is not shared, and cannot be cached by a proxy server.

5.  no-cache  forces all users who have cached the response to send a request with a validator to the server before using the cached data. Not literally not caching.

6.  No-store  prohibits caching, and each request must re-fetch data from the server.

7. must-revalidate specifies that if the page is expired, go to the server to obtain it. This command is not commonly used, and will not be discussed too much.

Negotiate cache

If the strong cache is not hit, the browser sends the request to the server. The server determines whether to hit the negotiation cache according to the Last-Modify/If-Modify-Since or Etag/If-None-Match in the http header. If it hits, the http return code is 304 and the browser loads the resource from the cache.

Last-Modify/If-Modify-Since

When the browser requests a resource for the first time, Last-Modify will be added to the header returned by the server. Last-modify is a time marking the last modification time of the resource, such as Last-Modify: Thu,31 Dec 2037 23:59 :59GMT.

When the browser requests the resource again, the sent request header will contain If-Modify-Since, which is the Last-Modify returned before caching. After the server receives If-Modify-Since, it determines whether the cache is hit according to the last modification time of the resource.

If the cache is hit, http304 is returned, and the resource content is not returned, and Last-Modify is not returned. Due to the comparative server time, the time gap between client and server will not cause problems. But sometimes it is not accurate to judge whether the resource is modified by the last modification time (the last modification time can be the same even if the resource changes). So ETag/If-None-Match appeared.

ETag/If-None-Match

Different from Last-Modify/If-Modify-Since, Etag/If-None-Match returns a verification code (ETag: entity tag). ETag can ensure that each resource is unique, and resource changes will lead to ETag changes*. A change in the ETag value indicates that the resource state has been modified. The server determines whether to hit the cache according to the If-None-Match value sent by the browser.

 

 

 ETag extension description

We have high hopes for ETag, hoping that it will generate a unique value for each url, and the ETag will change when the resource changes. How is the mysterious Etag generated? Taking Apache as an example, ETag generation depends on the following factors

  1. The i-node number of the file, this i-node is not the other iNode. is the number used by Linux/Unix to identify the file. Yes, it is not the filename that identifies the file. You can see it with the command 'ls -I'.
  2. file last modified time
  3. File size
    When generating Etags, one or several factors can be used to generate them using an anti-collision hash function. Therefore, in theory, the ETag will also repeat, but the probability is small enough to be ignored.

How can Etag be born after Last-Modified?

You might think that using Last-Modified is enough to let the browser know if the local cached copy is fresh enough, why do you need an Etag (entity identifier)? The emergence of Etag in HTTP 1.1 is mainly to solve several problems that are difficult to solve with Last-Modified:

1. The last modification marked by Last-Modified can only be accurate to the second level. If some files are modified multiple times within 1 second, it will not be able to accurately mark the modification time of the file.

2. If some files are generated regularly, sometimes the content has not changed, but the Last-Modified has changed, so that the file cannot use the cache

3. There may be situations where the server does not accurately obtain the file modification time, or is inconsistent with the proxy server time.

Etag is the unique identifier on the server side of the corresponding resource automatically generated by the server or generated by the developer, which can control the cache more accurately. Last-Modified and ETag can be used together. The server will first verify the ETag. If it is consistent, it will continue to compare Last-Modified, and finally decide whether to return 304.

User behavior and caching

Browser caching behavior is also related to user behavior! ! !

User action

Expires/Cache-Control

Last-Modified/Etag

Enter in the address bar

efficient

efficient

page link jump

efficient

efficient

new window

efficient

efficient

forward, backward

efficient

efficient

F5 refresh

invalid

efficient

Ctrl+F5 refresh

invalid

invalid

Summarize:

Browser first request:

 

When the browser requests again:

 


 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324692690&siteId=291194637