Distributed caching of large website architecture

Caching is the first means of optimizing website performance. In large websites, caches are often used to store hot data, or to store application context related information. For example, the session server cluster mentioned earlier can be built with distributed cache. Of course, distributed cache can also be used to cache hot data in the database to reduce the pressure on the database.

There are usually two types of distributed cache architectures: one is a distributed cache represented by JBoss that needs to be synchronized, and the other is a distributed cache represented by MemCached that does not communicate with each other.

JBoss keeps the same cached data in all servers. When the data of one server is updated, it will be synchronized to other servers. This way all servers hold the same data results in a limited cache capacity in JBoss, and the cost of synchronization increases as the number of servers increases. Therefore, this scheme is rarely used in large websites. JBoss usually deploys the application server and the cache on the same server, as shown in the figure below:


MemCached adopts a distributed architecture that does not communicate with each other, and the data cached in each server is different. The application server determines the location of the data through a routing algorithm such as consistent hash and then accesses it. The cache servers do not communicate with each other and have good scalability. As shown in the figure below:


The following focuses on the routing algorithm used in MemCached. Since the data cached by different servers in the MemCached cluster is different, the application server needs to determine the cache server where the data is located through the routing algorithm before accessing the data. The easiest routing algorithm to think of is the remainder hash, but when the cache system needs to be expanded, the remainder hash is unavailable.

For example, there were originally 3 nodes, and the corresponding hash values ​​were 0, 1, and 2. Now, after adding one node, it becomes 4 nodes, and the corresponding hash values ​​​​are 0, 1, 2, and 3. Then the 12 previously cached data from 0 to 11 have an availability rate of 25% after adding a new node. That is to say, after adding a new node to the cache system, most of the original cached data becomes available. This phenomenon is terrible in a large-scale website architecture, because the low availability of the cache caused by the newly added node will increase the pressure on the database and may even cause website accidents.
Therefore, after adding nodes to the can ensure that the cached data is still valid as the main design goal of the routing algorithm in the distributed cache. For this, there is a consistent hashing algorithm.

Consistent hash algorithm:
First construct an integer ring with a length of 2^32, and place the nodes of the cache server on this hash ring according to the hash value of the node name (its distribution range is 0 to 2^32-1). Calculate the hash value of the data that needs to be cached (the distribution range is 0 to 2^32-1), and then find the cache server node closest to the hash value clockwise on the hash ring, which is the target node for the data to be cached. As shown in the figure:


Consistent hash algorithm, there is a small problem . When a new node is added to the cache, according to the rules of the consistent hash algorithm, the newly added node can only relieve the pressure on an original server, as shown in the figure below: The newly added node Node3 can only relieve the pressure on Node1, but cannot Relieve the pressure on Node0 and Node2. The problem of unbalanced load after adding


new nodes mentioned above can be solved by virtual means : virtualize each physical server as a group of virtual servers, and then put this group of virtual servers on the hash ring. The virtual server is found by the consistent hash algorithm, and the physical server corresponding to the virtual server is the target node. As shown in the figure: In practice, a physical server is generally virtualized as a group of 150 virtual servers, and the specific situation can be changed.







Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326861868&siteId=291194637