An article to understand front-end performance optimization (detailed explanation in 2023)

      We often encounter the word performance optimization in front-end jobs or interviews. It seems that this thing is not difficult to say, after all, everyone can talk about it. But if you want to have a direct performance solution when encountering performance bottlenecks in various scenarios at work, or to impress the interviewer during an interview, then you can’t just stick to “say whatever comes to mind” or “ To give a rough idea, we need to have a systematic, in-depth knowledge map from all angles. This article can also be regarded as a summary of my personal front-end knowledge, because "performance optimization" is not just "optimization", what does it mean? Before implementing an optimization plan, you must first know why you need to optimize in this way and what is the purpose of doing so. This requires you to have a good understanding of the principles of framework, js, css, browser, js engine, network, etc. So performance optimization really covers too much front-end knowledge, even most front-end knowledge.

      First, let’s talk about the essence of front-end performance. The front-end is a network application. The performance of the application is determined by its operating efficiency. If we add the network before, it is related to network efficiency. So I think the essence of front-end performance is network performance and operating performance. Therefore, the two major categories in the front-end performance optimization system are: network and runtime, and then we subdivide each small area from these two major themes, which is enough to weave a huge front-end knowledge graph.

network level

      If we compare the network connection to a water pipe, if you want to open a page now, it can be seen as having a glass of water in the other person's hand, and you want to connect the water to your glass. If you want to go faster, there are three ways: 1. Make the flow of the water pipe larger and faster; 2. Let the other side reduce the water in the cup; 3. I have water in my cup and don’t need yours. Water pipe traffic is your network bandwidth, protocol optimization and other factors that affect network speed; when a cup of water becomes less, it means compression, code splitting, lazy loading and other means to reduce requests; the last one is to use caching.

      Let’s talk about network speed first. Network speed is not only determined by the user’s operator, but also by familiarizing yourself with the principles of network protocols and tuning network protocols to optimize its efficiency.

      The computer network is theoretically an OSI seven-layer model, but in practice it can be seen as five layers (or four-layer model), namely the physical layer, data link layer, network layer, transport layer, and application layer. Each layer is responsible for encapsulating, dismantling, and parsing its own protocols and performing its own tasks. For example, it is like the palace maid dressing and undressing the emperor layer by layer. You are responsible for the coat and I am responsible for the underwear, each performing his own duties. As a front-end, we mainly focus on the application layer and transport layer, starting with the application layer HTTP protocol that we deal with every day.

Optimization of http protocol

1. Under HTTP/1.1, it is necessary to avoid reaching the maximum concurrent limit of browser requests for the same domain name (usually 6 for Chrome)

  • When there are a large number of page resource requests, you can prepare multiple domain names and use different domain name requests to bypass the maximum concurrency limit.
  • Multiple small icons can be merged into one large image, so that multiple image resources require only one request. The front end displays the corresponding icons (also called sprite images) through the background-position style of CSS.

2. Reduce the size of HTTP header

  • For example, requests from the same domain will automatically carry cookies, which is a waste if authentication is not required. This kind of resources should not be in the same domain as the site.

3. Make full use of HTTP cache. Caching can directly eliminate requests and greatly improve network performance.

  • Browsers can use HTTP header values ​​such as no-cache and max-stale of cache-control to control whether to use strong caching, negotiate caching, and whether cache expiration is still available and other functions.
  • The server uses http header values ​​such as max-age, public, stale-while-revalidate of cache-control to control the strong cache time, whether it can be cached by the proxy server, how long the cache expires, and how long it takes to automatically refresh the cache.

4. Upgrading to HTTP/2.0 or higher can significantly improve network performance. (Must use TLS, i.e. https)

5. Optimize HTTPS

       There are two main aspects that consume a lot of HTTPS performance:

  • The first step is the TLS protocol handshake process;
  • The second step is the transmission of symmetrically encrypted messages after the handshake.

For the second step, the current mainstream symmetric encryption algorithms AES and ChaCha20 have good performance, and some CPU manufacturers have also made hardware-level optimizations for them, so the encryption performance consumption in this step can be said to be very small.

In the first step, the TLS protocol handshake process not only increases network delay (it can take up to 2 RTT network round-trip time), but some steps in the handshake process will also cause performance losses, such as:

      If the ECDHE key agreement algorithm is used, both the client and the server need to temporarily generate elliptic curve public and private keys during the handshake process; when the client verifies the certificate, it will access the CA server to obtain the CRL or OCSP in order to verify whether the server's certificate has been revoked; Then both parties calculate the Pre-Master, which is the symmetric encryption key. In order to better understand which stage of the entire TLS protocol handshake these steps are, you can refer to this picture:

TLS handshake

HTTPS can be optimized using the following means:

  • Hardware optimization: The server uses a CPU that supports the AES-NI instruction set
  • Software optimization: Upgrade Linux version and TLS version. TLS/1.3 has greatly optimized the number of handshakes, requiring only 1 RTT time, and supports forward security (meaning that if the key is cracked now or in the future, it will not affect the security of previously intercepted messages).
  • Certificate Optimization: OCSP Stapling. Under normal circumstances, the browser needs to verify with the CA whether the certificate has been revoked, and the server can periodically query the CA for the certificate status, obtain a response result with a timestamp and signature, and cache it. When a client initiates a connection request, the server will directly send the "response result" to the browser during the TLS handshake process, so the browser does not need to request the CA itself.
  • Session reuse 1: Session ID. Both parties retain the session in memory. The next time a connection is established, the hello message will carry the Session ID. After the server receives it, it will search it from the memory. If it finds it, it will directly use the session key to restore it. session state, skipping the rest of the process. For security, session keys in memory are periodically expired. But it has two disadvantages: 1. The server must save the session key of each client. As the number of clients increases, the memory usage of the server will increase. 2. Nowadays, website services are generally provided by multiple servers through load balancing. When the client connects again, it may not hit the server it visited last time. If it fails to hit the server, it still needs to go through the complete TLS handshake process.
  • Session reuse 2: Session Ticket, when the client and the server establish a connection for the first time, the server will encrypt the "session key" and send it to the client as a Ticket, and the client will save the Ticket. This is similar to the token scheme used to verify user identity in web development. When the client connects to the server again, the client will send a Ticket. If the server can decrypt it, it can obtain the last session key, and then verify the validity period. If there is no problem, the session can be restored and encrypted communication begins directly. Because only the server can encrypt and decrypt this key, as long as it can be decrypted, it means there is no fraud. For cluster servers, make sure that the key used to encrypt the "session key" on each server is consistent, so that when the client brings the Ticket to access any server, the session can be restored.

Neither Session ID nor Session Ticket have forward security, because once the key that encrypts the "session key" is cracked or the server leaks the key, the previously hijacked communication ciphertext can be cracked. At the same time, it is also difficult to face replay attacks. The so-called replay attack is that assuming that the middleman intercepts the post request message, although he cannot decrypt the information in it, he can reuse the non-idempotent message to request the server. Because there is a ticket server that can directly reuse https. In order to reduce the harm of replay attacks, a reasonable expiration time can be set for the encrypted session key.


The following is a detailed introduction to http knowledge points.

HTTP/0.9

The initial version is very simple, with the purpose of quickly promoting its use. The function is just a simple get html. The request message format is as follows:

GET /index.html

 HTTP/1.0

With the development of the Internet, http needs to meet more functions, so it has the familiar http header, status code, GET POST HEAD request method, cache, etc. It can also transmit binary files such as pictures and videos.

The disadvantage of this version is that the tcp connection will be disconnected after each request, and the next http request requires tcp to re-establish the connection. Therefore, some browsers have added a non-standard Connection: keep-alive header, and the server will reply with the same header. Through this agreement, TCP can maintain a long connection. Subsequent http requests can reuse this TCP until one party actively closes it.        

HTTP/1.1

Version 1.1 is currently widely used. In this version, tcp long connection is used by default. If you want to close it, you need to actively add the header Connection: close.

In addition, it also has a pipeline mechanism (pipelining). The client can continuously send multiple http requests in the same tcp connection without waiting for the http return. In the past, the design of HTTP requests was that only one HTTP request could be sent at a time in a TCP connection. Only after receiving its return value was the HTTP request completed, and the next HTTP request could be sent. Although the http/1.1 version can send multiple https continuously based on the pipeline mechanism, 1.1 can still only return responses in FIFO (first in, first out) order on the server, so if the first http is very slow during the response, the subsequent ones will still be Blocked by the first http. When receiving multiple consecutive responses, the browser will divide them by Content-Length.

In addition, chunked transfer encoding has been added, replacing the buffer form with a stream stream. For example, for a video, you no longer need to read it completely into the memory and then send it. You can use the stream to send a small part after each small part is read. Use the Transfer-Encoding: chunked header to turn on. There will be a hexadecimal number in front of each chunk to represent the length of the chunk. If the number is 0, it means that the chunk has been sent. In scenarios such as large file transfer or file processing, using this feature can improve efficiency and reduce memory usage.

This version has the following disadvantages:

1. Head-of-line blocking. A request-response is required before a complete http is completed, and then the next http can be sent. If the previous http is slow, it will affect the next sending time. At the same time, the browser has a maximum number of concurrent http requests for the same domain name. If it exceeds the limit, you must wait for the previous one to be completed.
2. http header redundancy. Maybe every HTTP request header in the page is basically the same, but these texts must be carried every time, which is a waste of network resources.

In fact, the shortcomings of http1.1 are essentially caused by its initial positioning as a plain text protocol. If you want to send out of order, you either need to modify the protocol itself, such as adding a unique identifier to the request/response, and then parse the text at the other end to find the corresponding order. Either you need to encapsulate the http protocol again, convert the text into binary data and perform additional encapsulation processing. According to the opening and closing principle, new additions are better than modifications, so obviously the latter solution is more reasonable. Therefore, http/2.0 will divide the original data into binary frames to facilitate subsequent operations, which is equivalent to adding some more steps on the original basis, and the original http core has not changed.

HTTP/2.0

New improvements not only include optimizing long-standing multiplexing in HTTP/1.1, fixing the head-of-line blocking problem, allowing request priority to be set, but also include a header compression algorithm (HPACK). In addition, HTTP/2 uses binary rather than clear text to package and transmit data between the client and the server.

Frames, messages, streams, and TCP connections

We can think of version 2.0 as adding a binary framing layer under http. A message (a complete request or response is called a message) is divided into many frames. The frame contains: type, length, flags, stream identifier Stream and payload frame payload. At the same time, the abstract concept of stream is also added. The stream identifier of each frame represents which stream it belongs to. Because http/2.0 can be sent out of order without waiting, the sender/receiver will send out of order according to the stream identifier. The data is assembled. In order to prevent conflicts caused by duplicate stream IDs at both ends, the stream initiated by the client has an odd ID, and the stream initiated by the server has an even ID. The content of the original protocol is not affected. The first information header in http1.1 is encapsulated into the Headers frame, and the request body will be encapsulated into the Data frame. Multiple requests use only one tcp channel. This initiative has shown in practice that new page loads can be accelerated by 11.81% to 47.7% compared to HTTP/1.1. Optimization methods such as multiple domain names and sprite images are no longer needed in http/2.0.

HPACK algorithm

The HPACK algorithm is a newly introduced algorithm in HTTP/2 and is used to compress HTTP headers. The principle is:

According to Appendix A of RFC 7541, the client and server maintain a common static dictionary (Static Table), which contains codes for common header names and combinations of common header names and values; the client and server follow the first
entry The first-out principle maintains a common dynamic dictionary (Dynamic Table) that can dynamically add content; the
client and server support Huffman Coding based on this static Huffman code table according to Appendix B of RFC 7541 ).

server push     

In the past, browsers needed to actively initiate requests to obtain server data. This requires adding additional js request scripts to the website, and you also need to wait for the js resources to be loaded before calling. This results in delayed request timing and more requests. HTTP/2 supports server-side active push, which does not require the browser to actively send requests, saving request efficiency and optimizing the development experience. The front end can listen to push events from the server through EventSource.

 

HTTP/3.0

HTTP/2.0 has made a lot of optimizations compared to its predecessor, such as multiplexing, header compression, etc., but because the underlying layer is based on TCP, some pain points are difficult to solve.

head-of-line blocking

HTTP runs on top of TCP. Although binary framing can already ensure that multiple requests at the HTTP level are not blocked, you can know from the TCP principles mentioned above that TCP also has head-of-line blocking and retransmission. If the previous The ack of the package is not returned, and the subsequent ones will not be sent. Therefore, HTTP/2.0 only solves the head-of-line blocking at the HTTP level, and is still blocked in the entire network link. It would be great if a new protocol could be used to transmit faster in modern network environments.

TCP, TLS handshake latency

TCP has 3 handshakes, TLS (1.2) has 4 handshakes, and a total of 3 RTT delays are required to issue an actual http request. At the same time, because TCP's congestion avoidance mechanism starts from slow start, it will further slow down the speed.

Switching networks causes reconnection

We know that the uniqueness of a TCP connection is determined based on the IP and port of both ends. Nowadays, mobile networks and transportation are very developed. When you enter the office or go home, your mobile phone will automatically connect to WIFI. It is very common for mobile phone networks to change signal base stations in subways and high-speed trains in ten seconds. They will all cause ip changes, thereby invalidating the previous TCP connection. What manifests itself is that a webpage that is halfway open suddenly cannot be loaded, and a video that is buffered halfway cannot be buffered at the end.

QUIC protocol

The above problems are inherent to TCP. To solve them, we can only change the protocol. http/3.0 uses the QUIC protocol. A completely new protocol requires hardware support, which will inevitably take a long time to popularize, so QUIC is built on an existing protocol UDP.

The QUIC protocol has many advantages, such as:

No head-of-line blocking


The QUIC protocol also has the concept of Stream and multiplexing similar to HTTP/2. It can also transmit multiple Streams concurrently on the same connection. A Stream can be considered as an HTTP request.

Since the transport protocol used by QUIC is UDP, UDP does not care about the order of packets, nor does UDP if packets are lost.

However, the QUIC protocol still needs to ensure the reliability of data packets. Each data packet is uniquely identified by a sequence number. When a packet in a flow is lost, even if other packets in the flow arrive, the data cannot be read by HTTP/3. The data will not be handed over to HTTP/3 until QUIC retransmits the lost packet.

As long as the data packet of a certain flow is completely received, HTTP/3 can read the data of this flow. This is different from HTTP/2, where if a packet is lost in one stream, other streams will be affected.

Therefore, there is no dependence between multiple Streams on the QUIC connection and they are all independent. If a certain stream loses packets, it will only affect that stream and other streams will not be affected.

Faster connection establishment

For the HTTP/1 and HTTP/2 protocols, TCP and TLS are layered and belong to the transport layer implemented by the kernel and the presentation layer implemented by the OpenSSL library respectively. Therefore, they are difficult to merge together and need to be shaken in batches. TCP handshake first , and then TLS handshake.

Although HTTP/3 also requires a QUIC protocol handshake before transmitting data, this handshake process only requires 1 RTT. The purpose of the handshake is to confirm the "connection ID" of both parties, such as connection migration (for example, the network needs to be migrated due to IP switching). Implemented based on connection ID.

The QUIC protocol of HTTP/3 is not layered with TLS, but QUIC contains TLS internally. It will carry the "record" in TLS in its own frame. In addition, QUIC uses TLS 1.3, so only one RTT can complete the connection establishment and key negotiation "simultaneously". Even during the second connection, the application data packet can be sent together with the QUIC handshake information (connection information + TLS information) to achieve the effect of 0-RTT.

As shown in the right part of the figure below, when the HTTP/3 session is restored, the payload data is sent together with the first packet, which can achieve 0-RTT:

 

Connection migration

When the mobile device's network switches from 4G to WiFi, it means that the IP address has changed, so the connection must be disconnected and then re-established. The process of establishing a connection includes the delay of the TCP three-way handshake and the TLS four-way handshake. And the deceleration process of TCP slow start gives users the feeling that the network is suddenly stuck, so the connection migration cost is very high. If you are on a high-speed train, your IP hUI may continuously change, which will cause your TCP connection to constantly reconnect.

The QUIC protocol does not use a four-tuple method to "bind" the connection, but uses the connection ID to mark the two endpoints of the communication. The client and the server can each choose a set of IDs to mark themselves, so even if the mobile device network After the change, the IP address changes. As long as the context information (such as connection ID, TLS key, etc.) is still retained, the original connection can be "seamlessly" reused, eliminating the cost of reconnection without any lag, achieving The function of connection migration is provided.

Simplified frame structure, QPACK optimized header compression

HTTP/3 uses the same binary frame structure as HTTP/2. The difference is that Stream needs to be defined in the binary frame of HTTP/2, while HTTP/3 itself does not need to define Stream anymore and uses the Stream in QUIC directly, so HTTP/ The frame structure of 3 has also become simpler.

HTTP/3 frames

  According to different frame types, they are generally divided into two categories: data frames and control frames. Headers frames (HTTP headers) and DATA frames (HTTP packet bodies) belong to data frames.

HTTP/3 has also been upgraded in terms of header compression algorithm, which has been upgraded to QPACK. Similar to the HPACK encoding method in HTTP/2, QPACK in HTTP/3 also uses static table, dynamic table and Huffman encoding.

Regarding the changes in the static table, the static table of HPACK in HTTP/2 has only 61 entries, while the static table of QPACK in HTTP/3 has been expanded to 91 entries.

The Huffman encoding of HTTP/2 and HTTP/3 is not much different, but the dynamic table encoding and decoding methods are different.

The so-called dynamic table, after the first request-response, both parties will update the Header items (such as some customized headers) not included in the static table into their respective dynamic tables, and then only use 1 number to represent them in subsequent transmissions. Then the other party can look up the corresponding data from the dynamic table based on this number, without having to transmit long data every time, which greatly improves the coding efficiency.

It can be seen that the dynamic table is sequential. If the first request header is lost, and subsequent requests encounter this header again, the sender will think that the other party has already stored it in the dynamic table, so it will compress the header. However, the other party cannot decode this HPACK header because the other party has not established a dynamic table. Therefore, the decoding of subsequent requests must be blocked until the data packet lost in the first request is retransmitted before normal decoding can be achieved.

HTTP/3's QPACK solves this problem, but how does it solve it?

QUIC will have two special one-way streams. Only one end of the so-called one-way stream can send messages. Bidirectional streams are used to transmit HTTP messages. The usage of these two one-way streams:

One is called QPACK Encoder Stream, which is used to pass a dictionary (Key-Value) to the other party. For example, when facing an HTTP request header that does not belong to a static table, the client can send the dictionary through this Stream; the other is called QPACK Decoder Stream, which is used to Respond to the other party and tell it that the dictionary just sent has been updated to its local dynamic table, and you can use this dictionary for encoding later. These two special one-way streams are used to synchronize the dynamic tables of both parties. The encoding party will use the dynamic table to encode the HTTP header after receiving the update confirmation notification from the decoding party. If the update message of the dynamic table is lost, it will only cause some headers to not be compressed, and will not block the HTTP request.

Detailed explanation of HTTP caching

If a network resource does not need to be requested and is obtained directly from the local cache, it is naturally the fastest. The cache mechanism is defined in the http protocol, which is divided into local cache (also called strong cache) and cache that needs to be verified through requests (also called negotiation cache).

Local cache (strong cache)

在 http1.0 是用expires响应头表示返回值的过期时间,浏览器在这个时间之内可以不重新请求直接使用缓存。在 http1.1 之后,改为了Cache-Control响应头,从此可以满足更多的缓存要求,里面的max-age表示资源在请求 N 秒后过期。注意max-age不是浏览器收到响应后经过的时间,它是在源服务器上生成响应后经过的时间,和浏览器时间无关。因此,如果网络上的其他缓存服务器将响应存储 100 秒(使用响应报头字段 Age 表示),浏览器缓存将从其过期时间中扣除 100 秒。当缓存过期后(我们忽略 stale-while-revalidate、max-stale 等的影响),浏览器会发起条件请求验证资源是否更新(也称协商缓存)。

条件请求(协商缓存)

请求头会有If-Modified-Since和If-None-Match字段,它们分别是上次请求响应头里的Last-Modified和etag。Last-Modified表示资源最后被修改的时间,单位秒。etag是特定版本资源的标识(比如对内容 hash 就可以生成一个 etag)。服务器当If-None-Match或者If-Modified-Since没有变化时会返回 304 状态码的响应,浏览器会认为资源没有更新从而复用本地缓存。由于Last-Modified记录的修改时间是秒为单位,如果修改频率发生在 1 秒内就不能准确判断是否更新了,所以etag的判断优先级要高于Last-Modified。

Cache-Control中如果设置no-cache会强制不使用强缓存,直接走协商缓存,即 max-age=0。如果设置no-store会不使用任何缓存。

浏览器对请求的缓存策略简单来说就是这样,我们可以看出缓存是由响应头和请求头决定的,开发过程中一般已经由网关和浏览器帮我们自动设置好了,如果你有特定需求,可以定制化使用更多Cache-Control功能。

完整的 Cache-Control 功能

Cache-Control also has more detailed cache control capabilities. For the complete meaning of response headers and request headers, see the table below.

response header

Request headers (only those that are not included in the response headers are listed) |max-stale|The cache is still available when it expires no more than max-stale seconds | |min-fresh|Requires the cache service to return fresh cached data within min-fresh seconds, otherwise it will not Use local cache | |only-if-cached| The browser requires that the target resource be returned only if the cache server has cached it |

Optimization of TCP protocol

You may need it when writing node. It’s okay, don’t worry, those who are only interested in pure front-end can skip it :)

First, we will directly give optimization methods for different problems. The specific tcp principles and why these phenomena occur will be introduced in detail later.

The following tcp optimization generally occurs on the request side

1. The size of the first request should not exceed 14kb, which can effectively utilize the slow start of tcp. The same can be done for the first package of the front-end page.

  • Assuming that the initial TCP window is 10 and the MSS is 1460, then the resource size of the first request should not exceed 14600 bytes, which is about 14kb. In this way, the opposite end's tcp can be sent in one go. Otherwise, it will be sent in at least 2 times, which requires an additional RTT (network round trip time).

2. What should I do if TCP is blocked due to frequent sending of small data packets (less than MSS)?
This is very common in game operations (although tcp protocol is generally not used) and command line ssh

  • Turn off Nagel's algorithm
  • Avoid delayed ack

How to optimize tcp packet loss retransmission

  • Turn on SACK via net.ipv4.tcp_sack (enabled by default)
  • Turn on D-SACK via net.ipv4.tcp_dsack (enabled by default)

The following tcp optimization generally occurs on the server side

1. The number of concurrent requests received by the server is too high or it encounters a SYN attack, causing the SYN queue to be full and unable to respond to requests.

  • Usesyn cookies
  • Reduce the number of syn ack retries
  • Increase syn queue size

2. Too many TIME-WAITs cause the available ports to be full and no more requests can be sent.

  • Use the tcp_max_tw_buckets configuration of the operating system to control the number of concurrent TIME-WAITs
  • If possible, increase the port range and IP address of the client or server

The above tcp optimization method is based on understanding the tcp mechanism and adjusting operating system parameters, which can achieve network performance optimization to a certain extent. Below we will start with the implementation mechanism of tcp, and then explain what these optimization methods do.

We all know that a connection must be established before TCP transmission, but in fact, network transmission does not require establishment of a connection. The network was originally designed to be bursty and send at any time, so the design of the telephone network was abandoned. Usually, the so-called TCP connection is actually just a state between two devices that saves some communication between each other, and is not a real connection. TCP needs to distinguish whether it is the same connection through five tuples, one of which is the protocol, and the remaining four are, src_ip, src_port, dst_ip, dst_port (double-ended ip and port number). In addition, there are four important things in the header of the tcp message segment. Sequence Number is the sequence number (seq) of the packet, which indicates the position of the first bit of the data part of this packet in the entire data stream, which is used to solve network packet chaos. order problem. Acknowledgment Number (ack) represents the length of the data received this time + the seq received this time. It is also the next sequence number of the other party (sender), which is used to confirm receipt and solve the problem of not losing packets. Window, also called Advertised-Window, is a sliding window used to implement flow control. TCP Flag is the type of packet, such as SYN, FIN, ACK, etc., which is mainly used to control the TCP state machine.

The principle part is introduced below:

tcp three times "handshake"

The essence of the three-way handshake is to know the initial sequence number, MSS, window and other information of both parties, so that data can be spliced ​​in an orderly manner in an out-of-order situation, and the maximum carrying capacity of the network and hardware can be ascertained.

The initial seq sequence number (ISN) is 32 bits, which is generated by the virtual clock by adding 1 continuously at a frequency of 4 microseconds. It returns to 0 when it exceeds 2^32, and a cycle takes 4.55 hours. The reason why each connection establishment does not start from 0 is to avoid the problem of seq conflict between new packets and old packets that arrive late after the connection is disconnected and reestablished. 4.55 hours has exceeded the Maximum Segment Lifetime (MSL), and the old package no longer exists.

  • The client sends a SYN (flags: SYN) packet, assuming that the initial seq is x, so seq = x. Client tcp enters SYN_SEND state.
  • The server tcp is initially in the LISTEN state. After receiving it, it sends an ACK packet (flags: ACK, SYN). Assume that the initial seq is y, seq = y, ack = x + 1. This is because flags has SYN and occupies 1 length. So next the client should start from x + 1. The server enters the SYN_RECEIVED state.
  • The client sends an ACK packet after receiving it, seq = x + 1, ack = y + 1. Then continue to send the actual content of the PSH packet (assuming the data length is 100), seq = x + 1, ack = y + 1. The reason why the actual contents of seq and ack are unchanged from the ack packet is because the Flag is ACK, which is only used for confirmation and does not occupy the length itself. The client enters the ESTABLISHED state.
  • The server sends an ACK packet after receiving it, seq = y + 1, ack = x + 101. The server enters the ESTABLISHED state.

The calculation of seq and ack can be compared with this packet capture picture (the picture is from the Internet, the sequence number in it is a relative sequence number)

The transfer of the tcp sending process failed. It is recommended to upload the image file directly.

SYN timeouts and attacks

During the three-way handshake, after the server receives the SYN packet and returns the SYN-ACK, TCP is in an intermediate state of semi-connection. The operating system kernel will temporarily put the connection into the SYN queue. After the three-way handshake is successful, the connection will be put into the complete queue. The connection queue. If the server does not receive the ACK from the client, it will time out and retry. The default retry is 5 times, doubling from 1s, 1s, 2s, 4s... until the fifth time out, it will take a total of 63s, at which time tcp will be disconnected. Drop this connection. Some attackers will take advantage of this feature to send a large number of SYN packets to the server and then disconnect. The server has to wait 63 seconds before clearing the connection from the SYN queue, causing the SYN queue of the server's TCP to be full and unable to continue to provide services. This situation may also occur under normal large concurrency conditions. At this time we can set the following parameters under linux:

  • tcp_syncookies, it can generate a special Sequence Number (also called cookie) from the quadruple information, the timestamp incremented every 64s, and the MSS option value after the SYN queue is full. This cookie can be sent to the client as a seq directly. Jianlian. In this clever way, tcp_syncookies saves some of the information in SYN without having to store it locally. Careful viewers will find that tcp_syncookies seems to only require two handshakes to establish a connection. Why not incorporate it into the tcp standard? Because it also has shortcomings, 1. The encoding of MSS is only 3 bits, so only 8 MSS values ​​can be used at most. 2. The server must reject other options in the client's SYN message that are only negotiated in SYN and SYN+ ACK, because the server has no place to save these options, such as Wscale and SACK. 3. Added cryptographic operations. Therefore, when the SYN queue is full due to large normal concurrency, do not use this method. It is just a emasculated version of tcp.
  • tcp_synack_retries, use it to reduce the number of retries for SYN-ACK timeout, which also reduces the cleaning time of the SYN queue.
  • tcp_max_syn_backlog, increases the maximum number of SYN connections, that is, increases the SYN queue.
  • tcp_abort_on_overflow, reject the connection when the SYN queue is full.

tcp "wave" four times

Assuming that the client disconnects first, the seq in the example follows the last handshake.

Before closing, the tcp status of both ends is ESTABLISHED.

  1. The client sends a FIN packet (flags: FIN) to indicate that it can be closed, seq = x + 101, ack = y + 1. The client changes to FIN-WAIT-1 state.
  2. The server receives this FIN and returns an ACK, seq = y + 1, ack = x + 102. The server changes to CLOSE-WAIT state. After receiving this ACK, the client changes to the FIN-WAIT-2 state.
  3. The server may have some unfinished work, and after completion, it will send a FIN packet to decide to close, seq = y + 1, ack = x + 102. The server changes to LAST-ACK state.
  4. The client returns a confirmation ACK after receiving the FIN, seq = x + 102, ack = y + 2. The client changes to TIME-WAIT state
  5. After receiving the ACK from the client, the server directly closes the connection and changes to the CLOSED state. If the client does not receive the FIN from the server again after waiting for 2*MSL time, it closes the connection and changes to the CLOSED state.

Why is a long TIME-WAIT needed? 1. It can avoid the new connection that reuses the four-tuple from receiving delayed old packets. 2. It can ensure that the server has been closed.

Why is the TIME-WAIT time 2*MSL (maximum segment survival time, RFC793 defines MSL as 2 minutes, and Linux sets it to 30s)? Because after sending the FIN, the server will resend if the wait for ACK times out. The FIN has the longest survival MSL time, and the retransmission must occur before this. The resent FIN also has the longest survival MSL time. Therefore, after 2 times the MSL time, the client still has not received the resend from the server, indicating that the server has received the ACK and closed, so the client can be closed.

What should I do if there are too many TIME-WAITs generated by disconnection?

We know that Linux will wait for 1 minute by default before closing the connection. At this time, the port is always occupied. If there is a large concurrent short connection, too many TIME-WAITs may cause the port to be full or the CPU to be occupied too much.

The last two configurations are strongly recommended not to be used.

  • tcp_max_tw_buckets, controls the number of concurrent TIME-WAITs. The default value is 180000. If it exceeds, the system will destroy and record the log.
  • ip_local_port_range, increase the client port range
  • If possible, increase the service port of the server (tcp connections are based on ip and port, the more they are, the more connections are available)
  • If possible, increase the client or server IP
  • tcp_tw_reuse, timestamp must be enabled on both the client and server before it can be used. It only takes effect on the client. After opening, there is no need to wait for TIME-WAIT, it only takes 1s. New connections can directly reuse this socket. Why do I need to enable timestamp? Because the packet of the old connection may go around and finally reach the server, and the new connection quintuple that reuses the socket is the same as the old packet. As long as the timestamp is earlier than the new packet, it must be the packet of the old connection, which can be avoided. Useless old packages were mistakenly accepted.
  • tcp_tw_recycle, tcp_tw_recycle processing is more aggressive, it will quickly recycle the socket in the TIME_WAIT state. Fast recycling will only occur when both tcp_timestamps and tcp_tw_recycle are enabled. When the client accesses the server through the NAT environment, the TIME_WAIT state will be generated after the server is actively closed. If the server has both the tcp_timestamps and tcp_tw_recycle options turned on, then the TCP segmentation time from the same source IP host within 60 seconds The stamp must be incremented, otherwise it will be discarded. Linux has removed the tcp_tw_recycle configuration starting from the 4.12 kernel version.

tcp sliding window and flow control

The operating system has opened a cache area for tcp, which limits the maximum number of data packets sent and received by tcp. It can be visualized as a sliding window. The sender's window is called the sending window swnd, and the receiver's is called the receiving window rwnd. The length of the data that has been sent but not received ack + the length of the buffered data to be sent = the total length of the sending window. 

Send window

 

During the handshake, both ends exchange window values, and the minimum value will eventually be taken. Assume that the sender's window size is 20, and 10 packets were sent at the beginning, but no ack was received yet, so only 10 more packets can be put into the buffer in the future. If the buffer is full, no more data can be sent. When the receiver receives data, it also puts it into the buffer. If the processing capability is less than the peer's sending capability, the buffer will pile up and the available receiving window will become smaller. The window value carried by ack will allow the sender to reduce the amount of data sent. . In addition, the operating system will also adjust the size of the buffer. At this time, a situation may occur. The original available receiving window is 10, which has been notified to the peer through ack, but the operating system suddenly shrinks the buffer and the window is reduced by 15. Instead, the available receiving window is reduced by 15. I owe 5. The sender has previously received that the available window is 10, so the data will still be sent, but the data cannot be processed by the receiver, so it times out. To avoid this situation, TCP enforces that if the operating system wants to modify the buffer, it must send the modified available window in advance.

We know from the above content that TCP limits the sending traffic through the windows at both ends. If the window is 0, it means that sending should be temporarily stopped. When the receiver's buffer is full and an ack with a window of 0 is sent, and after a period of time the receiver is able to receive, an ack with a window of non-0 will be sent to notify the sender to continue sending. If this ack is lost, it will be very serious. The sender never knows that the receiver can receive it, and it will keep waiting and enter a deadlock situation. In order to avoid this problem, the design of TCP is that after the sender is notified to stop sending (that is, after receiving the ack of window 0), it will start a timer and send a window probe (Window probe) every 30-60 seconds. After receiving the message, the receiver must reply to the current window. If the window detection is 0 for three consecutive times, some TCP implementations will send RST packets to interrupt the connection.

If the receiver's window is already very small, the sender will still use this window to send data. The tcp header + ip header is 40 bytes, and the data may only be a few bytes, which is very uneconomical. How to avoid this situation? Let’s take a look at how to optimize the decimal drama package.        

tcp small packet

For the receiver, as long as it is not allowed to send in a small window, the receiver usually has this strategy: if the receiving window is smaller than the minimum value of MSS and cache space/2, tell the peer that the window is 0 and stop sending data. Until the window is larger than that condition.

For the sender, using the Nagle algorithm, only one of the following two conditions is met before sending:

  • Window size >= MSS and total data size >= MSS
  • Receive the ack of previously sent data

If none of them are met, it will keep accumulating data and then send it all together when a certain condition is met.

The pseudo code is as follows

if there is new data to send then
    if the window size ≥ MSS and available data is ≥ MSS then
        send complete MSS segment now
    else
        if there is unconfirmed data still in the pipe then
            enqueue data in the buffer until an acknowledge is received
        else
            send data immediately
        end if
    end if
end if

The Nagle algorithm is turned on by default, but in scenarios such as ssh with small data and many interactions, Nagle will be very bad when it encounters delayed ack, so it needs to be turned off. (Nagle algorithm does not have global system configuration and needs to be turned off according to respective applications)

After talking about small data optimization, now let’s talk about the sliding window. In fact, the window finally adopted by tcp is not entirely determined by the sliding window. The sliding window only prevents both ends from exceeding the sending and receiving capabilities. The network conditions between the two ends must also be considered. If both ends send and receive The capabilities are very strong, but the network environment is very poor at the moment. Sending a large amount of data will only make the network more congested, so there is still a congestion window. TCP will take the minimum value of the sliding window and the congestion window.

tcp slow start and congestion avoidance

First of all, let’s talk about what MSS is. MSS is the maximum allowed data byte length of a tcp segment. It is the MTU (maximum data length of the data link layer, specified by the hardware) minus the ip header 20 bytes minus the tcp header 20 Calculated in bytes, it is usually 1460. This means that a TCP packet can carry up to 1460 bytes of upper layer data. The minimum MSS is negotiated at both ends during the tcp handshake. In an actual network environment, requests will pass through many intermediate devices, and the MSS in SYN will be modified by them. In the end, it will be the minimum value in the entire path, not just the minimum value at both ends.

TCP has a cwnd (congestion window) responsible for avoiding network congestion. Its value is an integer multiple of the TCP segment size, which represents how many packets TCP can send at one time (for convenience, we start from 1 to represent it). Its initial value is very is small, it will gradually increase until packet loss and retransmission occur to detect available network transmission resources. In the classic slow start algorithm, in the fast acknowledgment mode, each time a confirmation ack is successfully received, cwnd + 1, so cwnd increases exponentially, 1, 2, 4, 8, 16... until the slow start threshold ssthresh is reached. (slow start threshold), ssthresh is generally equal to max (outside data value/2, 2*SMSS), SMSS is the maximum segment size of the sender. When cwnd < ssthresh, the slow start algorithm is used. When cwnd >= ssthresh, the congestion avoidance algorithm is used.

Congestion avoidance algorithm, after each confirmation ack is received, cwnd will increase by 1/cwnd, that is, all the last sent packets are confirmed, cwnd + 1. Different from the slow start algorithm, the congestion avoidance algorithm increases linearly until two types of retransmissions occur and then decreases, 1. timeout retransmission occurs, 2. fast retransmission occurs.

Fast/delayed ack, timeout retransmission and fast retransmission

In the fast acknowledgment mode, the receiver sends a confirmation ack immediately after receiving the packet, but TCP does not return a confirmation ack every time it receives a packet, which is a waste of network bandwidth. TCP may also enter delayed acknowledgment mode. The receiving end will start the delayed ack timer and check whether ack is to be sent every 200ms. If there is data to be sent, it can also be merged with ack. Assuming that the sender sends multiple packets at once, the peer may not reply with 10 acks, but will only reply with the last ack of the largest consecutive packet received. For example, if 1, 2, 3,...10 are transmitted, the receiving end receives them all, so it replies with an ack of 10, so that the sending end knows that all the first 10 are received, and the next one is initiated from 11. If there is a packet loss in the middle, the ack before the packet loss is returned.

Timeout retransmission: The sender will start a timer after sending. The timeout (RTO) is suitable to be set to slightly larger than one RTT (packet round trip time). If the receiving ack times out, the data packet will be resent. If the resent data is If it times out, the timeout will be doubled. At this time, ssthresh becomes cwnd/2, cwnd is reset to the initial value, and the slow start algorithm is used. It can be seen that cwnd drops off a cliff, so the occurrence of timeout retransmission has a great impact on network performance. Do we have to wait for RTO before retransmitting?

Fast retransmission: TCP has a fast retransmission design. If the receiver does not receive the packet in order, it will reply with the largest consecutive ack. If the sender receives 3 such acks in a row, it will consider the packet lost and can quickly Retransmit that packet once without falling back to slow start. For example, the receiver received 1, 2, and 4, so it responded with an ACK of 2, and then received 5 and 6. Because 3 was interrupted in the middle, it still responded with an ACK of 2 twice. The sender received the same ack three times in a row, so it knew that 3 was lost and quickly retransmitted 3. The receiver receives 3 and the data is continuous, so an ack of 6 is returned, and the sender can continue to transmit from 7. Just like the picture below:

 

When a fast retransmission occurs:

  1. ssthresh = cwnd/2, cwnd = ssthresh + 3, start retransmitting lost packets, and enter the fast recovery algorithm. The reason for +3 is that 3 duplicate acks were received, indicating that the current network can at least normally send and receive these 3 additional packets.
  2. When a duplicate ACK is received, the congestion window is increased by 1
  3. When the ACK of the new data packet is received, cwnd is set to the value of ssthresh in the first step.

The fast retransmission algorithm first appeared in the Tahoe version of 4.3BSD, and the fast recovery first appeared in the Reno version of 4.3BSD. It is also called the Reno version of the TCP congestion control algorithm. It can be seen that Reno's fast retransmission algorithm is aimed at the retransmission of one packet. However, in practice, a retransmission timeout may cause the retransmission of many data packets, so when multiple data packets are lost from one data window Problems arise when fast retransmission and fast recovery algorithms are triggered. Therefore, NewReno appears. It is slightly modified based on Reno's fast recovery and can recover multiple packet losses within a window. Specifically: Reno exits the fast recovery state when it receives an ACK of new data, and NewReno needs to receive confirmation from all data packets in the window before it exits the fast recovery state, thereby further improving throughput. quantity.

How to "accurately" retransmit tcp

If partial packet loss occurs, the sender does not know which packets were lost partially or completely. For example, if the receiving end receives 1, 2, 4, 5, 6, the sending end can know through ack that packets after 3 are lost and trigger fast retransmission. There will be two decisions at this time: 1. Only retransmit the third packet. 2. I don’t know if packets 4, 5, 6... are also lost, so I simply retransmit everything after 3. Both options are not very good. If you only resend 3, if you really lose them later, each one will have to wait for retransmission. But if all are retransmitted directly, it would be wasteful if only 3 are lost. How should we optimize it?

Fast retransmission only reduces the chance of triggering timeout retransmission. Neither fast retransmission nor timeout retransmission solves the problem of accurately knowing whether to retransmit one or all of them. There is a better method called Selective Acknowledgment (SACK), which needs to be supported by both ends. Linux switches it through the net.ipv4.tcp_sack parameter. SACK will add a piece of data to the tcp header to tell the sender which data segments have been received in addition to the maximum continuous one, so that the sender knows that the data does not need to be retransmitted. A picture is worth a thousand words:

There is also Duplicate SACK (D-SACK). If the receiver's acknowledgment ACK is lost, the sender will mistakenly think that the receiver has not received it, triggering a timeout and retransmission. At this time, the receiver will receive duplicate data. Or because the sending packet encounters network congestion, the retransmitted packet arrives earlier than the previous packet, and the receiver will also receive duplicate data. At this time, you can add a piece of SACK data to the tcp header. The value is the repeated data segment range. Because the data segment is smaller than ack, the sender knows that the receiver has received the data and will not retransmit it.

 

 D-SACK is switched on and off in Linux through the net.ipv4.tcp_dsack parameter.

To sum up, the function of SACK and D-SACK is to let the sender know which packets have not been received and whether the packets have been received repeatedly. It can determine whether the data packet is lost, the ack is lost, the data packet is delayed by the network, or the network is interrupted. Copied the data packet.

A more powerful cache: Service Worker

The HTTP cache control mentioned above is mainly in the backend, and if the cache expires, although there is a negotiated cache, there will still be more or less requests, which requires a network, and it can generally only cache get requests. . These limitations prevent the front end from being able to do local applications like the client. So is there any way to make the front-end completely proxy cache? Whether it is static resources or API interfaces, everything can be decided by the front-end itself. It can even turn the web page into a complete local application like an App. This is the Service Worker we are going to talk about next, let us see what features it has.

Offline caching

Service Worker can be regarded as a proxy between the application and the network request. It can intercept the request and take appropriate actions based on whether the network is available or other custom logic. For example, you can cache HTML, CSS, JS, pictures and other resources after the application is opened for the first time. The next time you open the web page, intercept the request and return it directly to the cache, so that your application can be opened offline. If the device is connected to the Internet later, you can request the latest resources in the background and determine whether it has been updated. If it has been updated, you can remind the user to refresh and upgrade. In terms of startup, front-end applications using Service Worker do not require a network at all, just like client Apps.

push notification

In addition to proxying requests, Service Worker can also actively let the browser send notifications, just like App notifications. You can use this function to do "user recall", "hot notifications", etc.

Prohibited items

Our main js code is executed in the rendering thread, and the Service Worker runs in another worker thread, so it will not block the main thread, but it also causes some APIs to be unusable, such as operating dom. At the same time, it is designed to be completely asynchronous, so synchronous APIs such as XHR and Web Storage cannot be used, and fetch requests can be used. Dynamic import() is also not possible, only static import modules are possible.

For security reasons, Service Worker can only run on the HTTPS protocol (use localhost to allow http). After all, its ability to take over requests is already very powerful. If it is maliciously tampered with by a middleman, ordinary users can do this. The web page will never render the correct content. In FireFox, it's also not available in Incognito mode.

Instructions

The Service Worker code should be an independent js file and can be accessed through https requests. If you are in a development environment, you can allow access from addresses such as http://localhost. After preparing these, you must first register it in the project code:

if ("serviceWorker" in navigator) {
  navigator.serviceWorker.register("/js/service-worker.js", {
    scope: "../",
  });
} else {
  console.log("浏览器不支持Service Worker");
}

Assume that your website address is https://www.xxx.com, and the js of Service Worker is prepared at https://www.xxx.com/js/service-worker.js, and /js/service-worker.js is actually The request is https://www.xxx.com/js/service-worker.js. The scope in the configuration indicates the path under which Service Worker takes effect. If the scope is not set, the default root directory will take effect. Service Worker will be used in any path on the web page. According to the writing method in the example, if ./ is set, the effective path is /js/*, and ../ is the root directory.

Service Worker will go through these 3 life cycles

  1. Download
  2. Install
  3. Activate

The first is the Download stage. When entering a web page controlled by a Service Worker, it will start downloading immediately. If you have downloaded it before, the update may be determined after this download. The update will be determined under the following circumstances:

  1. A page jump within the scope occurred
  2. An event was triggered in the Service Worker, and it has not been downloaded within 24 hours.

When the downloaded file is found to be new, it will try to install Install. The criteria for judging whether it is a new file are: first download and byte-by-byte comparison with the old file.

If this is the first time the Service Worker is used, an installation is attempted and then upon successful installation, Activate it.

If an old Service Worker is already in use, it will be installed in the background and will not be activated after installation. This situation is called worker in waiting. Just imagine that the old and new js may have logical conflicts. The old js has been running for a while. If you directly replace the old one with the new one and continue to run the web page, it may directly crash.

When will the new Service Worker be activated? You must wait until all pages using the old Service Worker are closed before the new Service Worker becomes an active worker. You can also use ServiceWorkerGlobalScope.skipWaiting() to skip waiting directly. Clients.claim() allows the new Service Worker to control currently existing pages (those using the old Service Worker).

You can know when install or activate occurs by listening to events. The most commonly used event is FetchEvent, which is triggered when the page initiates a request. You can also use Cache to cache data and use FetchEvent.respondWith() to return the request return value you want. The following is a common way to write a cache request:

// 缓存版本,可以升级版本让过去的缓存失效
const VERSION = 1;

const shouldCache = (url: string, method: string) => {
  // 你可以自定义shouldCache去控制哪些请求应该缓存
  return true;
};

// 监听每个请求
self.addEventListener("fetch", async (event) => {
  const { url, method } = event.request;
  event.respondWith(
    shouldCache(url, method)
      ? caches
          // 查找缓存
          .match(event.request)
          .then(async (cacheRes) => {
            if (cacheRes) {
              return cacheRes;
            }
            const awaitFetch = fetch(event.request);
            const awaitCaches = caches.open(VERSION);
            const response = await awaitFetch;
            const cache = await awaitCaches;
            // 放进缓存
            cache.put(event.request, response.clone());
            return response;
          })
          .catch(() => {
            return fetch(event.request);
          })
      : fetch(event.request)
  );
});

The above code cache will not be updated after it is established. If your content may change and you are worried that the cache will be stale, you can return to the cache first to ensure that users can see the content as quickly as possible, and then request the latest in the Service Worker background The data is updated into the cache, and finally the main thread is notified to tell the user that the content has been updated, allowing the user to decide whether to upgrade the application. You can try to write the code for background request and update judgment by yourself. Here we mainly talk about how the Service Worker tells the main thread that the requested content has been updated. How to communicate between the two threads?

How Service Worker communicates with the main thread

Why is communication needed? First of all, if you want to debug, the console.log in the worker thread will not appear in DevTools. Secondly, if your Service Worker resource is updated, should it be notified to the main thread, so that your page can pop up a message to remind the user whether they want to update. So communication may be a business necessity. Since Service Worker is a separate thread, it cannot communicate directly with our main thread. But once you solve the communication problem, it can have many wonderful uses. For example, multiple pages on the same site can use Service Worker threads to communicate across pages. So how to solve the communication problem? We can create a message channel new MessageChannel(), which has two ports that can send and receive messages independently. Give one of the ports, port2, to the Service Worker, and leave the port1 port on the main thread, then they will You can communicate through this channel. The following code will show how to let two threads communicate with each other to achieve functions such as "printing worker thread log", "notifying content update", and "upgrading application". 

code in main thread 

const messageChannel = new MessageChannel();

// 将port2交给控制当前页面的那个Service Worker
navigator.serviceWorker.controller.postMessage(
  // "messageChannelConnection"是自定义的,用来区分消息类型
  { type: "messageChannelConnection" },
  [messageChannel.port2]
);

messageChannel.port1.onmessage = (message) => {
  // 你可以自定义消息格式来满足不同业务
  if (typeof message.data === "string") {
    // 可以打印来自worker线程的日志
    console.log("from service worker message:", message.data);
  } else if (message.data && typeof message.data === "object") {
    switch (message.data.classification) {
      case "content-update":
        // 你可以自定义不同的消息类型,来做出不同的UI表现,比如『通知用户更新』
        alert("有新内容哦,你可以刷新页面查看");
        break;
      default:
        break;
    }
  }
};

 Code in Service Worker

let messageChannelPort: MessagePort;

self.addEventListener("message", onMessage);

// 收到消息
const onMessage = (event: ExtendableMessageEvent) => {
  if (event.data && event.data.type === "messageChannelConnection") {
    // 拿到了port2保存起来
    messageChannelPort = event.ports[0];
  } else if (event.data && event.data.type === "skip-waiting") {
    // 如果主线程发出了"skip-waiting"消息,这里就会直接更新Service Worker,也就让应用升级了。
    self.skipWaiting();
  }
};

// 发送消息
const postMessage = (message: any) => {
  if (messageChannelPort) {
    messageChannelPort.postMessage(message);
  }
};

File compression, image performance, device pixel adaptation

Compression of resource files such as js, css, pictures, etc. can greatly reduce the size and greatly improve network performance. Generally, the backend service will automatically configure the compression header for us, but we can also switch to a more efficient compression algorithm to get a better compression ratio.

content-encoding

If you open any website and look at its resource network, you will see that there is a content-encoding header in the response headers, which can be gzip, compress, deflate, identity, br and other values. In addition to identity representing no compression, you can set other values ​​to compress the file to speed up http transmission, the most common of which is gzip. With compatibility support, you can specifically set some newer compression formats such as br (Brotli) to achieve a compression rate exceeding gzip.

font file

If a special font is required on the page, and the text on the page is fixed or small (for example, only letters and numbers), you can manually trim the font file so that it only contains the necessary text, which can greatly reduce the file size. .

If the words on the page are dynamic, you have no way of knowing what words they will be. In appropriate scenarios, such as scenarios where users can preview font effects when inputting text. Users generally only enter a few words, so there is no need to introduce the entire font package, but you don't know what the user will enter. So you can let the backend (or build a layer of bff based on nodejs) to dynamically generate a font file containing only a few words and return it to you based on the words you want. Although there is one more query request, font files of several Mb or even more than ten Mb can be reduced to a few kb in size.

Image Format

Pictures are generally not compressed through the above methods, because those picture formats have already been compressed for you, and compression again will not have much effect. Therefore, the choice of image format is the key to affecting image size and image quality. Generally speaking, the smaller the compression, the longer it takes and the worse the picture quality. But it’s not absolute. The new format may do everything better than the old format, but it has poor compatibility. So you need to find a balance.

In terms of image formats, in addition to the common PNG-8/PNG-24, JPEG, and GIF, we pay more attention to several other newer image formats:

  • WebP
  • JPEG XL
  • AVIF

Use a table to compare them in terms of image type, transparency channel, animation, encoding and decoding performance, compression algorithm, color support, memory usage, and compatibility:

 

 From a technical development perspective, priority is given to using relatively new image formats: WebP, JPEG XL, and AVIF. JPEG XL is very promising to replace the traditional image format, but the compatibility is still very poor. AVIF compatibility is better than JPEG XL, retaining high picture quality after compression and avoiding annoying compression artifacts and other problems. However, the decoding and encoding speeds are not as fast as JPEG XL, and progressive rendering is not supported. WebP is supported by basically all browsers except IE. For complex images (such as photos), WebP lossless encoding performance is not good, but lossy encoding performance is very good. The image decoding speed of WebP with similar quality is not much different from that of JPEG XL, but the file compression ratio can be improved a lot. So for now, it seems that if you want to improve the image performance of your website, it would be better to use WebP instead of the traditional format.        

Use of Picture element

So is there anything that can automatically help us use image formats similar to the WebP, AVIF and JPEG XL we mentioned above on browsers that support some modern image formats, while browsers that do not support fall back to regular JPEG, PNG method? The HTML5 specification adds a new Picture Element. The <picture> element provides versions of an image for different display/device scenarios by containing zero or more <source> elements and an <img> element. The browser will select the best matching child <source> element, or if there is no match, select the URL in the src attribute of the <img> element. The selected image is then rendered in the space occupied by the <img> element. 

<picture>
  <!-- 可能是一些对兼容性有要求的,但是性能表现更好的现代图片格式-->
  <source src="image.avif" type="image/avif" />
  <source src="image.jxl" type="image/jxl" />
  <source src="image.webp" type="image/webp" />

  <!-- 最终的兜底方案-->
  <img src="image.jpg" type="image/jpeg" />
</picture>

Image size adaptation: physical pixels, device-independent pixels

If you want great image performance, you must use appropriate image sizes for elements of different sizes. If a 500*500 image is displayed in a 100*100 pixel area, this is obviously a waste; on the contrary, a 100*100 image at 500*500 pixels is very blurry, which reduces the user experience. Before talking about size adaptation, we must first talk about what device independent pixels and physical pixels are, and what DPR is.

When we write width: 100px in CSS, what is displayed on the screen is actually a device-independent pixel (also called a logical pixel) of 100px length, which is not necessarily the 100 pixels (physical pixels) on the screen. On the original display, the device independent pixels and physical pixels were 1:1, that is, width: 1px corresponds to 1 pixel light-emitting point on the screen. With the subsequent development of display technology, the pixels of screens of the same size have become more and more refined. Maybe the position of one pixel is now composed of 4 pixels. This brings higher pixel density and a better visual experience, but it also creates a problem. If width: 1px represents a pixel light point as before, the same page will shrink on this device because the pixels are now smaller. In order to solve this problem, manufacturers created the concept of device independent pixels, which are not real pixels, but logical ones. If 1 pixel on the device is now replaced by 2 smaller pixels, then the device pixel ratio (DPR) of the device is 2, and an image drawn with width: 1px will be drawn by 2 pixels, so the size and It used to be consistent. Similarly, on a device with a finer screen, assuming it is made up of 3 smaller pixels instead of the traditional 1 pixel size, then its DPR is 3, and width: 1px is actually drawn by 3 pixels. Now you can understand why the interviewer asked questions like "How to draw a 1px border", because under high DPR your 1px is actually not 1px.

So we can get this pixel equation: 1 css pixel = 1 device independent pixel = physical pixel * DPR.

Provide appropriate pictures for different DPR screens

Therefore, although our img elements are all 100px, the optimal image size we need to display is actually different on different DPR devices. When DPR = 2, a 200px image should be displayed, and when DPR = 3, a 300px image should be displayed, otherwise blurry conditions will occur.

So, what are some possible solutions?

Option 1: Simple and crude multiple graphs

现在常见的设备里最高的 DPR 是 3,所以最简单的办法就是默认全部用最高的 3 倍图展示。但这会造成大量带宽的浪费,拖慢网络性能,降低用户体验,肯定不符合我们这篇文章的『格调』。

方案二:媒体查询

我们可以通过@media媒体查询来根据当前设备的 DPR 来应用不同的 css

#img {
  background: url([email protected]);
}
@media (device-pixel-ratio: 2) {
  #img {
    background: url([email protected]);
  }
}
@media (device-pixel-ratio: 3) {
  #img {
    background: url([email protected]);
  }
}

 这个方案的优点是,可以实现不同 DPR 下展示不同倍率的图片。

这个方案的缺点是:

  • 逻辑分支较多,而且市面上不止有 DPR = 2、3 的设备,甚至还有一些 DPR 是小数的设备,你需要覆盖全面得写很多代码。
  • 语法兼容性问题,比如在一些浏览器里它是-webkit-min-device-pixel-ratio。你可以通过autoprefixer解决,但也引入了额外成本。

 方案三:css image-set 语法

#img {
  /* 不支持 image-set 的浏览器*/
  background-image: url("../[email protected]");

  /* 支持 image-set 的浏览器*/
  background-image: image-set(
    url("./[email protected]") 2x,
    url("./[email protected]") 3x
  );
}

 其中的2x、3x就是匹配不同 DPR 的。image-set方案的缺点和媒体查询一样,就不多说了。优点是相比媒体查询更加小众,可以让你装一波。

方案四:srcset 元素属性

<img src="[email protected]" srcset="[email protected] 2x, [email protected] 3x" />

里面的2x、3x表示匹配不同的 DPR,[email protected]是兜底。优缺点和image-set一样,优点可能还多了一个不需要写 css,更简洁。

方案五:srcset 属性配合 sizes 属性 

<img
  sizes="(min-width: 600px) 600px, 300px"
  src="[email protected]"
  srcset="[email protected] 300w, [email protected] 600w, [email protected] 900w"
/>

sizes="(min-width: 600px) 600px, 300px"的意思是:如果屏幕当前的 CSS 像素宽度大于或者等于 600px,则图片的 CSS 宽度为 600px。反之,则图片的 CSS 宽度为 300px。因为你的布局可能是弹性的,所以在不同屏幕尺寸下,img 元素尺寸可能不一样,上面的其他方案都只能根据 DPR 判断,这一点不能做到。sizes 同时还需要@media 也根据宽度阈值实际对 img 做出宽度变化才行。

srcset="[email protected] 300w, [email protected] 600w, [email protected] 900w" The 300w, 600w, and 900w inside are called width descriptors. If you are on a device with a DPR of 2 and the CSS pixels of the img element are 300 based on sizes, then the actual physical pixels are 600, so a 600w image will be used.

The disadvantage of this solution is still the same as before, it needs to write different pictures for different DPR. But it has a unique advantage that it can flexibly change the actual image resolution according to the size of the img element in a responsive layout. Therefore I recommend option five.

Lazy loading and asynchronous decoding of images

Lazy loading of images means that when the page has not scrolled to the target area, the images there are not requested and displayed, so as to speed up the display of content in the visible area. The current front-end specifications are very rich. We have js, html and other methods to implement lazy loading of images. 

Option 1: Use onscroll in js

This is a simple and crude solution. Get the distance of all pictures on the page from the top of the viewport through getBoundingClientRectAPI, monitor the page scrolling through the onscroll event, and then calculate which pictures appear in the visible area based on the viewport height, and set the src attribute of the img element. Value to control image loading.

The advantage of this solution is that the logic is simple and easy to understand, no new APIs are used, and the compatibility is good.

The disadvantages of this solution are:

  1. Need to introduce js, which brings some code amount and calculation cost
  2. Need to obtain the position information of all image elements, which may trigger additional reflow
  3. Need to monitor scroll at all times and trigger callbacks frequently
  4. If a scrolling list is nested in the page, this solution cannot know the visibility of elements in the nested scrolling list, and requires more complex writing.

Option 2: Use IntersectionObserver in js

Through the IntersectionObserver API of HTML5, the Intersection Observer (intersection observer) cooperates with the isIntersecting attribute of the monitoring element to determine whether the element is within the visible area, and can implement a lazy loading solution for images with better performance than monitoring onscroll. The observed element will trigger a callback when it appears or disappears in the visible area, and the threshold of the appearance ratio can also be controlled. See the mdn documentation for details.

The advantages of this solution are:

  1. The performance is much better than onscroll. It does not need to monitor scrolling at all times, nor does it need to obtain the element position. The visibility is known by the render thread when drawing, and there is no need to judge it through js. This way of writing is more natural.
  2. It can really know the visibility of elements. For example, if an element is blocked by a higher-level element, it is invisible, even if it already appears in the visible area. This is something that the onscroll solution cannot do.

The disadvantages of this solution are:

  1. Need to introduce js, which brings some code amount and calculation cost
  2. Older devices are not compatible and need to use polyfill

Option 3: css style content-visibility

If an element with the content-visibility: auto style is not currently on the screen, the element will not be rendered. This method can reduce the drawing and rendering work of elements in non-visible areas, but image resources are requested when HTML is parsed, so this CSS solution cannot truly implement lazy loading of images.

Solution 4: HTML attribute loading=lazy

<img src="xxx.png" loading="lazy" />

Picture asynchronous decoding solution

As we all know, images such as jpeg, png, etc. are encoded. If you want the GPU to recognize it and render it, it needs to be decoded. If certain image formats are decoded very slowly, it will affect the rendering of other content. Therefore, HTML5 added a new decoding attribute to tell the browser how to parse image data.

Its optional values ​​are as follows:

  • sync: Decode the image synchronously to ensure it is displayed together with other content.
  • async: Decode images asynchronously to speed up display of other content.
  • auto: Default mode, indicating that the decoding mode is not preferred. It's up to the browser to decide which method is more appropriate for the user.
<img src="xxx.jpeg" decoding="async" />

This allows the browser to decode the image asynchronously, speeding up display of other content. This is an optional part of your image optimization plan.

Image performance optimization summary

In general, for image performance optimization, you need to:

  1. Choose an image format with high compression rate, fast decoding speed, and good image quality.
  2. Adapt appropriate image resolution according to actual DPR and element size
  3. Use a better performance solution for lazy loading of images, and use asynchronous decoding depending on the situation.

Build tool optimization

There are many popular front-end building/packaging tools now, such as the old ones webpack, , , rollup, which have become popular in recent years vite, snowpacknew forces esbuild, swc, turbopackand so on. Some of them are implemented in js, some are written in high-performance languages ​​such as go and rust, and some construction tools use esm features for on-demand packaging. But these are optimizations for speed during development or construction, and have little to do with client performance, so we won’t go into them here. We will mainly talk about the optimization of network performance through production environment packaging. Although these tools have various configurations, the commonly used optimization points are: code compression, code splitting, public code extraction, CSS extraction, resource use CDN, etc., but the configuration methods are different. Just check the documentation for this. There are many It works right out of the box. Some people may not understand these words very well. Here is an explanation.

There is nothing to explain about code compression. It means replacing variable names, removing newlines, removing spaces, etc., to make the code smaller.

The purpose of code splitting is that, for example, in SPA, page A is redirected from the homepage through local routing, so there is no need to package the A page component with the main application on the homepage, because users may not necessarily jump to it and package it in At the same time, it increases the size of the homepage package and affects the speed of the first screen. So in some build tools, you can use dynamic import ( import('PageA.js')), and the build tool will package the page A code referenced on the homepage into a new package, for example a.js. When the user clicks on the homepage to jump to page A, a.jsthe component code inside will be automatically requested, and then the route will be switched and rendered. Some frameworks will work out of the box and do not require you to write dynamic imports. Just define the route and it will automatically separate the code for you, such as react's nextjs framework. This is just one usage scenario of code separation. In short, as long as you don't want a certain module code to be packaged with the main application, you can split them to get better performance of the first batch of js packages.

The purpose of common code extraction is, assuming you are writing a SPA, you use the ramda library in pages A, B, and C, and the code is split on these three pages, and now they are three independent packages: a.js, b.js, c.js. Therefore, according to normal logic, the ramda library, as their dependency, will also be included in these three packages, which means that these three pages have duplicate ramda codes for no reason. Then this is not good. The best way is to put the ramda library as a separate package in the main application, so that it only needs to be requested once, and ABC can use this library. This is what common code extraction does. For example, in webpack you can define how many times a module is repeatedly depended on before it will be extracted into a separate package as a common chunk.

optimization: {
  // split-chunk-plugin 是webpack内置的插件 作用是自动将多个入口用到的公共文件抽离出来单独打包
  splitChunks: {
    chunks: 'all',
    // 最小30kb
    minSize: 30000,
    // 被引用至少6次
    minChunks: 6,
  },
}

 However, starting from webpack4, it can automatically help you optimize through mode. You don't actually need to worry about this thing. You can read more about the documentation of the build tool you use to avoid unnecessary optimization.

css 抽离的目的是,比如你在 webpack 中只用了 css-loader + style-loader,那你的 css 会被编译进 js 里,渲染样式时 js 帮你插入 style。那你的 js 无形中就变大了,并且 css 样式的渲染延后到了 js 执行的时候,而 js 一般是被打包在页面末尾的,也就是直到最后 js 请求、执行完之前,你的页面一直没有样式。理想情况应该是 css 和 dom 并行解析、渲染,这也是为什么要抽离 css,它会把 css 单独打包成 css 文件,放在 html 开头的 link 标签里,而不是放进 js 里。

Tree Shaking 的优化

我们知道打包工具在打包时会基于 esm 的 Tree Shaking 帮我们做无用代码去除(dead code removal)。

比如这里有个 bar.js

// bar.js
export const fn1 = () => {};

export const fn2 = () => {};

 然后在 index.js 中使用了它的fn1函数

// index.js
import { fn1 } from "./bar.js";

fn1();

 如果我们以 index.js 为入口打包,那么最终 fn2 会被移除。

但 tree shaking 在一些场景下会失效,它必须要求你的代码没有『副作用』,也就是初始化时不能对外造成影响,类似函数式编程里的副作用。

看下面的例子:

// bar.js
export const fn3 = () => {};
console.log(fn3);

export const fn4 = () => {};
window.fn4 = fn4;

export const fn5 = () => {};
// index.js
import { fn5 } from "./bar.js";

fn5();

虽然没有用到fn3和fn4,但最终打包时会将它们都打包进去。因为在声明它们时产生了副作用:打印、修改外部变量。如果不保留它们就可能会出现与预期不一致的 bug,比如你以为 window 被改了,其实没改,对象属性是可以设置 setter 的,甚至会有更多意想不到的 bug。

另外还有这些写法也是不可以的:

// bar.js
const a = () => {};
const b = () => {};
export default { a, b };

// import o from './bar.js'
// o.a()
// bar.js
module.exports = {
  a: () => {},
  b: () => {},
};

// import o from './bar.js'
// o.a()

你不能把导出的东西放进一个对象,esm 的 Tree Shaking 是静态分析,不能知道运行时做了什么。 还有使用了 commonjs 模块化语法,虽然打包工具可以兼容它们混用,但也很容易造成 Tree Shaking 失效。

所以为了完全利用 Tree Shaking 特性,一定要注意写法。上线前可以用打包分析工具看看哪些包大小异常。

前端技术栈的优化

In addition to affecting the runtime speed, the selection of the technology stack may also have an impact on the network speed.

Replaced with a smaller library. For example, if you use lodash, even if you only use one function in it, it will package all the content in it, because it is based on commonjs and does not do Tree Shaking at all. If you are sensitive to web page speed, you can consider using something else. Library replacement.

Code redundancy caused by development methods. For example, if you are using style solutions such as sass, less, native css, styled-component, emotion, etc. It is easy for you to write repeated style code. For example, component A and component B both have width: 120px;. You will most likely write it twice. It is difficult for you to achieve fine-grained reuse (almost no one will repeat a line of style). (Maybe 7 or 8 lines are the same before you think of reusing them). The bigger the project, the older it is, the more repeated style codes, and then your resource files will get bigger and bigger. You can change to tailwindcss, which is an atomic css library. If you need width: 120px; style, in react you can write <div className="w-[120px]"></div>, all the same The formulas are all written like this, and they all reuse the same class. Using tailwind can keep your css resources small enough without any runtime overhead. At the same time, because it follows components, you can take advantage of esm's tree shaking. Some components that are no longer used will be automatically removed from the package together with their styles. In the case of sass, css and other solutions, it is difficult to automatically remove styles that are no longer used in a css file. In addition, CSS-in-JS solutions such as styled-component and emotion can also achieve tree shaking, but they have problems with code duplication and runtime overhead. There are also some shortcomings of tailwind, such as not supporting lower versions of nodejs, and the grammar has a learning cost.

runtime level

Runtime mainly refers to the process of executing JavaScript and page rendering, which involves technology stack optimization, multi-thread optimization, V8-level optimization, browser rendering optimization, etc. 

How to optimize rendering time

Rendering time is not only affected by the complexity of your DOM and style, it is also affected by many aspects.

There are many kinds of tasks in the rendering thread

Before talking about this section, we need to talk about the concept of tasks first. Some people may already have some understanding of macro tasks, such as the code in the script and some callbacks (events, setTimeout, ajax, etc.). These are macro tasks. But you may only understand the details of the macro task, and lack a broader understanding of the task. Only by understanding it from a higher level can you truly understand why js and rendering must be blocked, and why there is no gap between two macro tasks next to each other. It may not be implemented quickly.

When you open a page, the browser will start a rendering process with a rendering thread in it. Most of the front-end things run on this rendering thread, such as dom, css rendering and js execution. Since there is only a single thread, in order to process time-consuming tasks without blocking, a task queue is designed. When operations such as requests and IO are encountered, they will be handed over to other threads. After completion, the callbacks will be put into the queue and the rendering thread The head task in this queue is always polled and executed. Most tasks of js can be understood as macro tasks. But it’s not just js. The rendering of the page is also a task for the rendering thread. You can see the task responsible for rendering in performance in DevTools (it is composed of a series of tasks such as Parse HTML, layout, paint, etc.), js The execution of the so-called macro task is actually the Evaluate Script task (which includes Compile Code, Cache Script Code and other sub-tasks, responsible for runtime compilation, caching code, etc.), which will initially be a sub-task in the Parse HTML task. There are also many built-in tasks, such as GC garbage collection. There is also a special kind of task called microtasks, which are Run Microtasks in performance. They are generated in macro tasks and placed in the micro task queue inside macro tasks. When the macrotask is executed and all execution stacks exit, there will be a checkpoint. If there are microtasks in the microtask queue, all will be executed. Microtasks can be created like Promise.then, queueMicrotask, MutationObserver events, nextTick in node, etc.

Therefore, now that we understand the tasks in the rendering thread, it is not difficult to find that since rendering itself is also a task, it must be sequential in the queue with js tasks and other tasks, and needs to be executed one by one. This is how blocking occurs. Let's take a look at the blocking relationship between various resources.

To give a typical example of js blocking rendering, you can create an html file yourself and try it:

<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <script>
      const endTime = Date.now() + 3000;
      while (Date.now() <= endTime) {}
    </script>
    <div>This is page</div>
  </body>
</html>

 The rendering thread first executes the Parse HTML task, and encounters script during parsing the DOM, so Evaluate Script is executed. The code will be executed for 3 seconds before it ends, and then it will continue to parse and render the following <div>This is page</div>, so It will take 3 seconds for the page to appear. If the script is a remote resource, the request will also block the underlying DOM parsing and rendering.

We can optimize it through the defer attribute of the script. Defer will delay the execution time of the script until after the DOM is parsed and before the DOMContentLoaded event.

<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <script defer src="xxx.very_slow.js"></script>
    <div>This is page</div>
  </body>
</html>

 In this way, there is no need to waste time waiting for requests. At the same time, it can also ensure that js must be parsed in the dom to obtain elements more safely, and multiple defer scripts will ensure the original execution order. Or you can achieve a similar effect by writing the script directly at the bottom of the page. Don't worry about the script request written at the bottom being delayed. Browsers generally have an optimization mechanism that will scan requests for all resources in HTML in advance and pre-request them when parsing the document starts.

script also has another attribute async. If the js resource is still being requested, the js request and execution will also be skipped. The following content will be parsed first and executed immediately after the js request is completed. Therefore, its execution timing is not fixed and depends on when the request ends, and the execution order of multiple async scripts is not guaranteed.

Will css block rendering and js?

Just remember one conclusion here. The request and parsing of css will not block the parsing of the dom below, but it will block the rendering of the render tree and the execution of js.

As for why it is designed like this:

The render tree is blocked because it is originally the product of the cascading style sheet applied to the dom tree, so it must wait for css. Although it is designed not to wait for css, there is no problem. You can render the dom tree first, and then render the complete render tree. , but rendering twice is wasteful, and the user experience of a bare DOM tree is not good.

The reason why js will be blocked by css may be because the style can be modified in js. If the later js is executed and the style is modified first, and the previous css is then applied, there will be a style result that is inconsistent with the order of writing the code. You can only continue It is wasteful to render the styles in js twice to achieve the actual expected effect. And the element style can be obtained in js. If the following js is executed before the css request is parsed, the obtained style will not match the actual situation.

So in summary, although css will not directly block the parsing of dom, it will block the rendering of the render tree, and indirectly block the parsing of dom by blocking the execution of js.

If you are interested, you can build a node service yourself to experiment. By controlling the resource response time, you can test the mutual influence of various resources.

Why does browser rendering take so long? What is the rendering pipeline

Rendering an html into a page generally requires the following steps:

  • Generate dom tree: When getting the html, what exactly did the browser do to make the page appear? First, pre-parse all resource requests inside and issue pre-requests. Then the html is parsed lexically and grammatically. When it encounters element tags such as <body> and <div> and attributes such as class and id, it parses and generates a dom tree. During this period, you may encounter css and js tags, such as <style>, <link>, and <script>. The css resource request will not block the dom tree parsing. If the dom tree has finished parsing the css, it will not be parsed until then. Block the creation of the render tree and subsequent layout trees, etc. If you encounter js, whether it is code execution or resource request, it will wait until all executions are completed before continuing the dom parsing, unless the script tag has async or defer attributes. If there is a css resource in front of the js, the js will not be executed until the css is requested/parsed, which will cause the css to indirectly block the parsing of the dom. If css-related code is encountered, the next step will be performed to parse the css into a stylesheet.
  • Generating stylesheet: css also undergoes lexical and grammatical analysis, and some of its values ​​will be standardized. What is standardization? For example, the font-weight: bold and flex-flow: column nowrap you wrote are not actually standard CSS styles, but an abbreviation. It needs to be converted into a value that the engine can understand: font-weight: 500 , flex-direction: column, flex-wrap: nowrap. Finally, the serialized text becomes a structured stylesheet.
  • Generate Render Tree: With the DOM tree and stylesheet, you can add styles to the corresponding DOM through inheritance, CSS selector priority and other style rules. The CSS selector will match the conditions from right to left, so that the number of matches is relatively minimal, and eventually Form a render tree with styles.
  • Layout: Some DOM are not displayed, such as display: none, so a layout tree will be formed based on the render tree, which only contains nodes that will appear in the future to avoid invalid calculations. At the same time, the layout stage will calculate the layout position information of each element. This is time-consuming, and the positions of elements will affect each other.
  • Layer: Then different layers will be formed according to some special styles, such as position: absolute, transform, opacity, etc. The root node and scroll will also be counted as one layer. Because different layer layouts generally do not affect each other, layering can reduce layout costs in subsequent updates, and also facilitate subsequent composite layers to perform special transformations on individual layers.
  • Paint (drawing): This is not actually drawing to the display, but generating its own drawing commands for each layer. These commands are basic commands for GPU drawing, such as drawing a straight line, etc.
  • Composite (composite): At this step, the CPU will no longer execute the task, and the task will be handed over to the GPU for processing, so if js is blocked, it will not affect this thread. CSS hardware acceleration also occurs in this thread. The drawing command list in the paint stage will be handed over to the composition layer. The composition layer will divide the area near the current viewport into tiles, in units of 512px, and render the tile area first. Other pages that are not near the viewport can wait until they are free. Render again. The composition layer passes the drawing commands to the GPU through the rasterization thread pool to draw and output bitmaps. Since these bitmaps belong to each layer, these layers need to be synthesized into one bitmap by the composition layer. Once a layer has been rasterized, a compositing layer can composite multiple layers, stacking them together in the correct order to form the final render. This process is usually performed in the GPU to reduce the workload of the CPU and improve rendering performance.

Explain rasterization: Rasterization is a concept in computer graphics. In the composition layer, convert vector graphics, text, pictures and other elements in the layer into bitmap or raster images. This allows these elements to be rendered and displayed faster because bitmaps are processed more efficiently in the graphics hardware. The composition layer can draw the content that needs to be rendered into an off-screen memory area instead of rendering it directly on the screen. This avoids performance issues caused by drawing directly on the screen and allows the browser to optimize off-screen content in the background. By rasterizing layer content, the browser can better take advantage of graphics hardware acceleration for rendering. Graphics processing units (GPUs) in modern computers and mobile devices can efficiently process bitmap images, providing smoother animations and faster rendering speeds.

  • Display: Wait for the monitor to send a sync signal, which means the next frame is about to be displayed. The bitmap of the composite layer will be handed over to the biz component in the browser process, and the bitmap will be put into the back buffer. When the monitor displays the next frame, the front and back buffers will be swapped to display the latest page image. The callback of requestAnimationFrame in js is also triggered because the sync signal knows that the next frame is about to be rendered. There is also vertical synchronization in the game.

During rendering, these steps are executed in sequence like a pipeline. If the pipeline starts execution from a certain step, it will inevitably execute all the following steps to the end.

Therefore, if the page is updated because the position/layout related styles are modified, the 2-stage layout will be re-triggered to recalculate the layout, which is called reflow (reflow or reflow). Since the positions of a large number of elements need to be calculated, and the positions will affect each other, it can be seen that this step is very time-consuming. At the same time, all subsequent steps will also be executed, such as paint and composite, so reflow must be accompanied by repaint.

If the page is updated due to a position-independent style modification (such as background-color, color), it will only be retriggered from the 4th stage paint, because the data that the previous process depends on has not changed. This will regenerate the draw command and then rasterize and composite it on the compositing layer. The whole process is still very fast, so just redrawing is much faster than reordering.

How to use rendering principles to improve performance

The browser itself has some optimization methods. For example, you don’t have to worry about color: red; width: 120px causing repeated paint due to order issues, and you don’t have to worry about the performance deterioration caused by multiple consecutive modifications to styles and multiple consecutive append elements. . The browser does not start rendering immediately after you modify it. It just puts the updates in the waiting queue, and then updates them in batches after a certain number of modifications or a certain period of time.

When we write code to update the page, the principle is to trigger as few rendering pipelines as possible. Starting from the paint stage will be much faster than the layout stage. Here are some common considerations:

  1. Avoid indirect reflow. In addition to directly modifying position-related styles, some situations may indirectly modify the layout. For example, if your box-sizing is not border-box, and the width is not fixed, then if you add or modify border-width, it will affect the width of the box model and the layout position. For example, <img /> does not specify a height, causing the height of the image to be raised after loading, causing the page to reflow.
  2. The principle of separation of reading and writing. js obtaining element position information may trigger forced reflow, such as getBoundingClientRect, offsetTop, etc. As mentioned earlier, the browser updates in batches and there is a waiting queue. Therefore, when you obtain the location information, the waiting queue may not be cleared due to updates, and the page may not be the latest. In order to ensure that the data you get is accurate, the browser will forcefully clear the queue and force the page to reflow. When you obtain the location information for the second time, and no update occurs during the period, the waiting queue is empty, and the reflow will not be triggered again. So if you want to batch modify the size of a batch of elements and obtain their size information, you must not write like this:
const elements = document.querySelectorAll(".target");
const count = 1000;
for (let i = 0; i < count; i++) {
  // 将元素width增加20px
  elements[i].style.width = parseInt(elements[i].style.width) + 20 + "px";
  // 获取该元素最新宽度
  console.log(elements[i].getBoundingClientRect().width);
}

 After the above explanation of browser batch updates and forced reflow, it can be seen that writing this way is very problematic, and the page will reflow 1000 times! Because every time you modify style.width, the browser will put the update into the waiting queue. There is nothing wrong with this step. But then you start to get the width of this element, so in order to know the latest width, the browser will clear the waiting queue, skip the batch update, and force the page to reflow. Then keep doing this for 1000 times.

And if you write like this, 1000 size modifications will only reflow once:

const elements = document.querySelectorAll(".target");
const count = 1000;
for (let i = 0; i < count; i++) {
  // 将元素width增加20px
  elements[i].style.width = parseInt(elements[i].style.width) + 20 + "px";
}
for (let i = 0; i < count; i++) {
  // 获取该元素最新宽度
  console.log(elements[i].getBoundingClientRect().width);
}

 css hardware acceleration

The calculations in the stages before the synthesis layer are basically performed by the CPU. The CPU has far fewer computing units than the GPU. Although it is powerful for complex tasks, it is much slower than the GPU for simple and repetitive tasks. If the rendering of the page starts directly from the composition layer and is calculated only by the GPU, the speed will inevitably be very fast. This is hardware acceleration. What methods can be turned on?

CSS styles such as transform 3D and opacity do not involve reflow and redrawing, but only layer transformations. Therefore, the previous layout and paint stages are skipped and handed over directly to the composition layer. The GPU performs some simple transformations on the layer. That's it, it's very simple for the GPU to handle these things. There is also a CSS attribute called will-change, which is specially used for hardware acceleration. It will tell the GPU in advance which attributes will change in the future, so that you can prepare in advance.

要注意一点的是,用 js 修改样式,哪怕修改的是上述的可以硬件加速的样式,也是会经过 CPU 的。还记得那个渲染流水线吗,js 只能修改 dom 树的内容,必然会触发 dom 变化,所以会从渲染流水线的第一步开始直到最后,并不会直接从合成层开始。

既然 js 修改的样式不能硬件加速,那还能怎么修改呢?可以用 animation 或者 transition 这类非 js 的方式。你可以实验一下,在 js 完全阻塞页面时,看看 animation 是不是依然在工作。

        如何记录性能和排查渲染卡顿

判断一个页面卡不卡,并不是自己试了觉得不卡就行,从主观上考量不能给出量化数据,工作中一定要从数据出发才能让别人信服。 

1. 开发者指标

如果你想查看本机打开某个页面的性能,可以去 DevTools 里的 Lighthouse 

 

 点击Analyze page load按钮,会生成一份性能报告。

 

 里面包含了首次绘制时间、可交互时间等性能指标,可访问性指标,用户体验指标,SEO 指标,PWA(渐进式 web 应用)等。

如果你想排查导致本机渲染卡顿的原因。你可以去 DevTools 里的 performance,它清晰的展示了耗时从低到高的各项任务我们可以看到Long Task这个词,它指的是长任务,长任务的定义是阻塞主线程达 50 毫秒或以上的任务。你可以点击一个Long Task任务,详细查看这个任务里做了什么(例子中因为querySelectorAll耗时太久,所以需要优化)

2. 真实用户监控

以上仅适合开发过程中临时排查,针对的设备只是你的计算机和你的网络情况。不能知道实际上线后的项目在不同网络环境、设备、地理位置的用户那里是怎样的性能表现。所以要想知道真实的性能指标,还需要其他办法。

要判断性能好坏,首先必须要定义一套明确的指标名称。那判断性能需要哪些指标呢?

 

 How to get these indicators? Modern browsers generally have performance APIs, in which you can see a lot of detailed performance data. Although some of the above data are not directly available, you can calculate it through some basic APIs.

 

You can see that there are, for example: eventCounts - number of events, memory - memory usage, navigation - page opening method, number of redirects, timing - dns query time, tcp connection time, response time, dom parsing and rendering time, interactive time, etc.

In addition, performance also has some very useful APIs, such as performance.getEntries().

It will return an array listing all resources and the time spent at key moments. Among them are first-paint-FP and first-contentful-paint-FCP indicators. If you only want to find certain specific performance reports, you can use performance.getEntriesByName() and performance.getEntriesByType() to filter.

Let’s talk about how the TTI (Time to Interactive time to interact) indicator should be calculated:

  1. First get the First Contentful Paint first content drawing (FCP) time, which can be obtained through performance.getEntries() above.
  2. Search for a quiet window with a duration of at least 5 seconds in the forward direction of the timeline, where the quiet window is defined as: no long tasks (long tasks, js blocks tasks for more than 50ms) and no more than two network GET requests being processed.
  3. Search the last long task before the quiet window in the reverse direction along the timeline. If no long task is found, execution stops at the FCP step.
  4. TTI is the end time of the last long task before the quiet window (same as the FCP value if no long task is found).

Maybe the difficulty is that people don’t know how to get long tasks. There is a class called PerformanceObserver that can be used to monitor performance data. Add longtask to entryTypes to obtain long task information. You can also add more types to obtain other performance indicators. For details, you can see the documentation of this class. The following is an example of monitoring longtask:

const observer = new PerformanceObserver(function (list) {
  const perfEntries = list.getEntries();
  for (let i = 0; i < perfEntries.length; i++) {
    // 这里可以处理长任务通知:
    // 比如报告分析和监控
    // ...
  }
});
// register observer for long task notifications
observer.observe({ entryTypes: ["longtask"] });
// 之后如果有长任务执行的话,会把执行数据放入性能检测队列
// 于是就会在observer中得到"longtask" entries.

After you write the code to count various performance indicators (or directly use a ready-made library), you can bury it in the user's page and report it to the performance statistics backend when the user opens the page.

How to optimize js

There are many optimization angles for js, which need to be divided into different scenarios, so we have to talk about it from the perspectives of technology stack selection, multi-threading, and v8. 

Technology stack selection

1. Page rendering solution selection

1. CSR browser-side rendering: Nowadays, spa front-end frameworks such as react and vue are quite popular. State-driven spa applications can achieve rapid page switching. But the disadvantage that comes with it is that all the logic is in the js on the browser side, causing the first screen startup process to be too long.

2. SSR server-side rendering: Because the innate rendering process of spa (CSR browser-side rendering) is longer than that of server rendering (SSR), it will go through html request -> js request -> js execution -> js rendering content -> After mounting the dom, request the interface -> update the content. Server-side rendering only requires the following steps: html request -> page content rendering -> js request -> js execution of adding events. When it comes to page content rendering, server-side rendering is much faster than SPA, which is very suitable for scenarios where users are expected to see the content as soon as possible.

You can use react's nextjs framework or vue's nuxtjs framework, which can achieve server-side isomorphic rendering with the same set of code at the front and back ends. The essential principle is the server-side running environment brought by nodejs and the multi-platform rendering capabilities brought by the virtual dom abstraction layer. With the same set of code, both the browser and the server can render (the server renders html text), and the spa method can still be used when the page jumps twice, without losing the speed of the first screen. The page switching speed of spa. At the same time, the SSR performance of the two major frameworks is also constantly optimized. For example, in React18's SSR, the new renderToPipeableStream api can stream HTML and has the Suspense feature, which can skip time-consuming tasks and allow users to see the main page faster. You can also use selective hydration to selectively wrap components that do not need to be loaded synchronously with lazy and suspense (the same as client-side rendering), optimize the interactive time of the main page, and indirectly achieve the code in ssr segmentation.

3. SSG static page generation: For example, React's nextjs framework also supports generating static sites, running your components directly during packaging to generate the final HTML. Your static pages can be opened without runtime, achieving the ultimate opening speed.

4. App client-side rendering: If your front-end page is placed in the App, the client can implement the same mechanism as server-side rendering. At this time, opening the page in the App is similar to server-side rendering. Or a simpler approach is to put your front-end spa package in the client package, which can also be opened instantly. Their biggest speed-up point is actually that users also download front-end resources when installing the App.

2. Choice of front-end framework

In the modern front-end development process, framework development is generally chosen without hesitation. But if your project is not complicated now or in the future, and you are extremely pursuing performance, then you don't actually need to use state-driven frameworks like react and vue. Although you can use them to enjoy the development convenience of updating only the status page by modifying it, there is also a performance cost when improving DX (developer experience). First of all, due to the introduction of additional runtime, the number of js has increased. Secondly, because they are at least component-level rendering, that is, after the state changes, the corresponding components will be re-executed in full, so the obtained virtual DOM must go through diff to achieve better browser rendering performance. These extra links mean that it is definitely not as fast as directly using js or jquery to accurately modify the dom. So if your project is not complicated now or in the future, and you want it to be fast and lightweight enough, then you can implement it directly with js or jquery.

3. Optimization of the framework

If you choose the React framework, you generally need to do some additional optimization during the development process. For example, use useMemo to cache data when dependencies remain unchanged, useCallback to cache functions when dependencies remain unchanged, Class Component's shouldComponentUpdate to determine whether the component needs to be updated, etc. Because React internally determines whether to update based on whether the variable reference address changes, so even if two objects or array literals are exactly the same, they are two different values. This should be noted.

Also, if possible, try to keep using the latest version. Generally, new versions will optimize performance.

For example, React18 adds a task priority mechanism to prevent long tasks from blocking page interaction. Low-priority updates will be interrupted by high-priority updates (such as user clicks and inputs), and low-priority updates will continue until the high-priority updates are completed. In this way, users will feel that the response is timely when interacting. You can use useTransition and useDeferredValue to generate low-priority updates.

In addition, React18 also optimizes batch updates. In the past, batch updates were actually implemented through a lock mechanism, similar to:

lock();
// 锁住了,更新只是放进队列并不会真的更新

update();
update();

unlock();
// 解锁,批量更新

 This will limit you to using batch updates only in fixed places, such as life cycles, hooks, and react events, because there will be no locks outside of react. Moreover, if you use APIs such as setTimeout and ajax that are independent of the current macro task, the updates inside will not be updated in batches.

lock();

fetch("xxx").then(() => {
  update();
  update();
});

unlock();
// updates已经脱离了当前宏任务,一定在unlock之后才执行,这时已经没有锁了,两次update就会让react渲染两次。

 The batch update of React18 is designed based on priority, so it does not need to be in the place specified by react to batch update.

4. Selection of framework ecology

In addition to the framework itself, its ecological selection will also have an impact on performance. The ecology of vue is generally relatively fixed, but the ecology of react is very rich. In order to pursue performance, you need to understand the characteristics and principles of different libraries. Here we mainly talk about global state management and selection of style solutions.

When selecting a state management library, react-redux may have performance issues under extreme conditions. Note that we are talking about react-redux, not redux. Redux is just a general library. It is very simple and can be used in various places. It is impossible to directly talk about the performance. react-redux is a library used to enable react to use redux. Because the redux state is a new reference every time, react-redux cannot know which components that depend on the state need to be updated. You need to use a selector to compare the before and after values. Whether it changes. Each selector of a component that relies on global state needs to be executed once. If the logic in the selector is heavy or the number of components is large, performance problems will occur. You can try mbox. Its basic principle is the same as vue. It triggers updates based on the getter and setter of the intercepted object, so it naturally knows which component needs to be updated. In addition, zustand has a good experience and is highly recommended. Although it is also based on redux, it is very convenient to use and can be used outside of components without a lot of template code.

Among the style solutions, only the css-in-js solution is possible with runtime, such as styled-component and emotion. But not absolutely, some css-in-js libraries will remove the runtime when you don't dynamically calculate styles based on props. If your style is calculated based on component props, then the runtime is essential. It will calculate the css when executing the component js and then add the style tag for you. This will bring about two problems, one is the performance cost, and the other is that the style rendering time is delayed to the js execution stage. You can optimize this using solutions other than css-in-js, such as css, sass, less, stylus, and tailwind. The most recommended here is tailwind, which was mentioned when talking about network-level optimization. It not only has zero runtime, but also allows you to fully reuse styles due to atomization, and your css resources will be very small.

js multithreading

We knew earlier that js tasks will block page rendering, but what if a long task is necessary for business? Such as large file hash. At this time, we can start another thread, let it run this long task and tell the main thread the final result. 

Web Worker 

const myWorker = new Worker("worker.js");

myWorker.postMessage(value);

myWorker.onmessage = (e) => {
  const computeResult = e.data;
};
// worker.js
onmessage = (e) => {
  const receivedData = e.data;
  const result = compute(receivedData);
  postMessage(result);
};

A Web Worker can only be accessed by the thread that created it, which is the page window that created it.

Shared Worker

Shared Worker can be accessed by multiple different windows, iframes, and workers.

const myWorker = new SharedWorker("worker.js");

myWorker.port.postMessage(value);

myWorker.port.onmessage = (e) => {
  const computeValue = e.data;
};
// worker.js
onconnect = (e) => {
  const port = e.ports[0];

  port.onmessage = (e) => {
    const receivedData = e.data;
    const result = compute(receivedData);
    port.postMessage(result);
  };
};

About thread safety

Because the Web Worker has carefully controlled communication points with other threads, it is actually difficult to cause concurrency problems. It cannot access non-thread-safe components or the DOM. You must pass specific data in and out of the thread via serialized objects. So you have to work really hard to create problems in your code.

Content security policy

Workers have their own execution context, which is different from the context of the document that created them. Therefore, Workers will not be managed by the document's Content security policy. For example, document is controlled by this http header

Content-Security-Policy: script-src 'self'

 This will prevent all scripts in the page from using eval(). But if a Worker is created in the script, eval() can still be used in the Worker thread. In order to control the Content-Security-Policy in the Worker, you need to set it in the http response header of the Worker script. One exception is that if your Worker's origin is a globally unique identifier (such as blob://xxx), it will inherit the document's Content-Security-Policy.

data transfer

The data passed between the main thread and the Worker thread is copied rather than shared memory address. Objects are serialized before being passed and then deserialized when received. Most browsers implement copy using the structured cloning algorithm.

Adapt to V8 engine internal optimization

V8 compilation pipeline

  1. Prepare the environment: V8 will first prepare the runtime environment of the code. This environment includes heap space and stack space, global execution context, global scope, built-in built-in functions, extension functions and objects provided by the host environment, and message loop system. Initialize the global execution context and global scope. The execution context mainly includes variable environment, lexical environment, this and scope chain. Variables declared by var and function will be put into the variable environment. This step is done before executing the code, so the variables can be promoted. Variables declared by const and let will be put into the lexical environment, which is a stack structure. Every time you enter and leave the {} code block, they will be pushed and popped from the stack, and the ones popped out of the stack will not be accessible, so const and let have lexical effects. domain.
  2. Construct an event loop system: The main thread needs to continuously read tasks from the task queue for execution, so it is necessary to construct a loop event mechanism.
  3. Generate bytecode: After V8 prepares the runtime environment, it will first perform lexical and syntactic analysis (Parser) on the code and generate AST and scope information. After that, the AST and scope information are input to an interpreter called Ignition. , and convert it into bytecode. Bytecode is a platform-independent intermediate code. The advantage of using bytecode here is that it can be compiled into optimized machine code, and caching bytecode saves a lot of memory than caching machine code. Parsing will be delayed when generating bytecode. V8 will not compile all the code at once. If it encounters a function declaration, it will not immediately parse the code inside the function. It will only generate AST and bytecode from the top-level function.
  4. Execute bytecode: The interpreter in V8 can directly execute bytecode. In the bytecode, the source code is compiled into Ldar, Add and other assembly-like instructions, which can implement instructions such as fetching, parsing instructions, executing instructions, and storing Data etc. There are generally two types of interpreters: stack-based and register-based. Stack-based interpreters use stacks to save function parameters, intermediate calculation results, variables, etc. Register-based virtual machines use registers to save parameters and intermediate calculation results. Most interpreters are stack-based, such as the Java virtual machine, .Net virtual machine, and the early V8 virtual machine. The current V8 virtual machine adopts a register-based design.
  5. JIT just-in-time compilation: Although bytecode can be executed directly, it takes a long time. In order to improve the code execution speed, V8 adds a monitor in the interpreter. During the execution of bytecode, if a certain piece of code is found to be repeated If executed multiple times, monitoring will mark this code as hot code.

         When a certain piece of code is marked as a hot code, V8 will hand the bytecode to the optimizing compiler TurboFan. The optimizing compiler will compile the bytecode into binary code, and then perform the compilation on the compiled binary code. Optimize the operation, and the execution efficiency of the optimized binary machine code will be greatly improved. If this code is executed later, V8 will give priority to the optimized binary code. This design is called JIT (just-in-time compilation).

          However, unlike static languages, JavaScript is a flexible dynamic language. The types of variables and the properties of objects can be modified at runtime. However, the code optimized by the optimization compiler can only target fixed types. Once During the execution process, the variables are dynamically modified, then the optimized machine code will become invalid code. At this time, the optimizing compiler needs to perform de-optimization operations, and will fall back to the interpreter for interpretation and execution the next time it is executed. , the additional de-optimization process is slower than conventional direct execution of bytecode.

From the above compilation pipeline, we can know that js repeatedly executes a piece of the same code multiple times. Due to the existence of JIT, the speed is very fast (at the same level as statically strongly typed languages ​​such as java and c#). But the premise is that your type and object structure cannot be changed at will. For example, the following code.

const count = 10000;
let value = "";
for (let i = 0; i < count; i++) {
  value = i % 2 ? `${i}` : i;
  // do something...
}

Optimization of V8 engine storage objects

JS objects are stored in the heap. It is more like a dictionary, with strings as key names. Any object can be used as a key value, and the key value can be read and written through the key name. However, when V8 implemented object storage, it did not completely use dictionary storage, mainly due to performance considerations. Because the dictionary is a non-linear data structure, hash calculation and hash conflicts cause query efficiency to be lower than sequentially stored data structures. In order to improve storage and search efficiency, V8 adopts a complex storage strategy. Sequential storage structures are a continuous piece of memory, such as linear lists and arrays. Non-linear structures generally occupy non-contiguous memory, such as linked lists and trees.

The object is divided into regular properties and sorting properties. Numeric properties are automatically sorted in ascending order, called sorting properties, and placed at the beginning of all properties of the object. String properties are placed inside regular properties in the order they are created.

Within V8, in order to effectively improve the performance of storing and accessing these two properties, two linear data structures are used to store sorting properties and regular properties respectively, namely the two hidden properties of elements and properties.

When these two conditions are met: no new attributes are added after the object is created; no attributes are deleted after the object is created, V8 will create a hidden class for each object, and there is a map attribute value in the object pointing to it. The hidden class of an object records some basic layout information of the object, including the following two points: all attributes contained in the object; and the offset of each attribute value relative to the object's starting memory. In this way, there is no need for a series of processes when reading attributes. You can directly get the offset and calculate the memory address.

But js is a dynamic language, and object properties can be changed. Adding new attributes to an object, deleting attributes, or changing the data type of an attribute will change the shape of the object, causing V8 to rebuild new hidden classes and reducing performance.

Therefore, it is not recommended to use the delete keyword to delete the attributes of an object or add/modify attributes unless necessary. It is best to determine when the object is declared. It is best to ensure that the same object literals are exactly the same when declared at the same time:

// 不好,x、y顺序不同
const object1 = { a: 1, b: 2 };
const object2 = { b: 1, a: 2 };

// 好
const object1 = { a: 1, b: 2 };
const object2 = { a: 1, b: 2 };

 The first way of writing the two objects has different shapes, which will generate different hidden classes and cannot be reused.

When the properties of the same object are read multiple times, V8 will create an inline cache for it. For example this code:

const object = { a: 1, b: 2 };

const read = (object) => object.a;

for (let i = 0; i < 1000; i++) {
  read(object);
}

The normal process for reading object attributes is: find hidden classes -> find memory offsets -> get attribute values. V8 optimizes this process when read operations are performed multiple times.

Inline cache is referred to as IC. When V8 executes a function, it will observe some key intermediate data at the call site (CallSite) in the function, and then cache these data. When the function is executed again next time, V8 can directly use these intermediate data. This saves the process of obtaining these data again, so V8 can effectively improve the execution efficiency of some repetitive codes by using IC.

The inline cache maintains a feedback vector (FeedBack Vector) for each function. The feedback vector is composed of many items, each item is called a slot. In the above code, V8 will sequentially write the intermediate data executed by the read function into the slot of the feedback vector.

Return object.a in the code is a call point, because it reads the object properties, then V8 will allocate a slot for this call point in the feedback vector of the read function, and each slot includes the index of the slot ( slot index), slot type (type), slot state (state), the address of the hidden class (map), and the offset of the attribute. When V8 calls the read function again and executes return object.a, It will search the offset of the a attribute in the corresponding slot, and then V8 can directly obtain the attribute value of object.a in the memory, which has faster execution efficiency than searching in the hidden class.

const object1 = { a: 1, b: 2 };
const object2 = { a: 3, b: 4 };

const read = (object) => object.a;

for (let i = 0; i < 1000; i++) {
  read(object1);
  read(object2);
}

 If the code becomes like this, we will find that the shapes of the two objects read in each loop are different, so their hidden classes are also different. When V8 reads the second object, it will find that the hidden class in the slot is different from the one being read, so it will add a new hidden class and attribute value memory offset to the slot. At this time, there will be two hidden classes and offsets in the slot. Each time an object's properties are read, V8 compares them one by one. If the hidden class of the object being read is the same as one of the hidden classes in the slot, then the offset of the hit hidden class is used. If there is no equivalent, the new information is also added to that slot.

  • If a slot contains only 1 hidden class, this state is called monomorphic ( monomorphic);

  • If a slot contains 2 to 4 hidden classes, this state is called polymorphism ( polymorphic);

  • If there are more than 4 hidden classes in a slot, this state is called a super state ( magamorphic).

It can be seen that the performance of monomorphism is the best, so we can try to avoid modifying objects or reading multiple objects in a function that is executed multiple times to achieve better performance.

This can lead to one thing. When I looked at the React17 version before, the official explanation on "Why use _jsx instead of createElement" talked about some shortcomings of createElement. It mentioned that createElement is "highly polymorphic" and difficult to Optimized from the V8 level. In fact, if you understand this article, you will understand what that sentence means. It is actually the superstate mentioned in the article, so you will understand why the official said that createElement is difficult to optimize. The createElement function will be called many times in the page, but the component props and other parameters it accepts are different, so a lot of inline caches will be generated, so it is said to be "highly polymorphic (hyperstate)". (But at least this point_jsx still seems to have not been solved, but it is still very powerful that they can realize this)


It’s finally over! It is not easy to make, please indicate when reprinting. Give it a thumbs up! ! !

Guess you like

Origin blog.csdn.net/YN2000609/article/details/132403663