Sub-library and sub-table

Chapter 1 Introduction

With the widespread popularity of Internet applications, the storage and access of massive data has become a bottleneck in system design. For a large-scale Internet application, billions of PVs per day undoubtedly cause quite high load on the database. It has caused great problems to the stability and scalability of the system. To improve website performance through data segmentation, horizontal scaling of the data layer has become the preferred method for architecture developers.

    Horizontally slicing database: it can reduce the load of a single machine and minimize the loss caused by downtime;
    load balancing strategy: it can reduce the access load of a single machine and reduce the possibility of downtime;
    cluster solution: solved The problem of single-point database inaccessibility caused by database downtime; read-
    write separation strategy: maximizes the speed and concurrency of reading data in applications;

Chapter 2 Basic Principles and Concepts
What is data sharding

"Shard" The word means "fragment" in English, and as a technical term related to databases, it seems to be first seen in MMORPGs. "Sharding" is called "sharding". Sharding is not a function attached to a specific database software, but an abstract process on top of specific technical details. It is a solution for horizontal expansion (Scale Out, or horizontal expansion and outward expansion). The I/O capability limitation of a single-node database server solves the problem of database scalability. The data is horizontally distributed to different DBs or tables through a series of segmentation rules, and the specific DBs or tables that need to be queried are found through the corresponding DB routing or table routing rules to perform Query operations. "Sharding" usually means "horizontal slicing", which is the focus of this article. Next, let's take a simple example: Let's illustrate the logs in a Blog application. For example, the log article table has the following fields:

Faced with such a table, how do we divide it? How to distribute such data to tables in different databases? We can do this, put all the article information with user_id 1~10000 into the article table in DB1, put all the article information with user_id 10001~20000 into the article table in DB2, and so on, until DBn. In this way, the article data is naturally divided into various databases to achieve the purpose of data segmentation.

The next problem to be solved is how to find the specific database? In fact, the problem is also simple and obvious. Since we use the distinguishing field user_id when sub-database, it is natural that the process of database routing is of course indispensable to user_id. That is, when we know the user_id of this blog, we use this user_id, use the rules of sub-database, and in turn locate the specific database. For example, if user_id is 234, using the rule just now, it should locate DB1. If user_id is 12343, using this rule, it should locate DB2. By analogy, using the rules of sub-database, reverse routing to a specific DB, we call this process "DB routing".

Usually, we will consciously design our database according to the paradigm. Considering the DB design of data segmentation, it will violate the usual rules and constraints. In order to split, we have to have redundant fields in the database tables, which are used as distinguishing fields or tag fields called sub-databases. For example, the field of user_id in the example of the article above (of course, the example just now does not reflect the redundancy of user_id well, because the field of user_id will appear even if it is not divided into databases, so we picked it up cheap). Of course, the appearance of redundant fields does not only appear in the scenario of sub-database. In many large-scale applications, redundancy is also necessary. This involves the design of efficient DBs, and this article will not repeat them.
Why do we need data segmentation? The

above is a brief description and explanation of what data segmentation is. Readers may ask, why do we need data segmentation? Is a mature and stable database like Oracle enough to support the storage and query of massive data? Why do we need data slicing?

Indeed, Oracle's DB is indeed very mature and stable, but the high usage fees and high-end hardware support cannot be afforded by every company. Just imagine the usage cost of tens of millions a year and a minicomputer that costs tens of millions of yuan as hardware support. Is this something that ordinary companies can afford? Even if we can afford it, if there is a better solution, there is a cheaper solution with better horizontal scalability, why don't we choose it?

We know that each machine has its own physical upper limit no matter how well it is configured, so when our application has reached or far exceeded a certain upper limit of a single machine, we have to look for the help of other machines or continue to upgrade. Our hardware, but the common solution is to scale out by adding more machines to share the pressure. We also have to consider when our business logic continues to grow, can our machines be able to meet the demand through linear growth? Sharding can easily distribute computation, storage, and I/O to multiple machines in parallel, which can make full use of various processing capabilities of multiple machines, and at the same time can avoid a single point of failure, provide system availability, and perform good error isolation. .

In view of the above factors, data segmentation is necessary. We use free MySQL, cheap Server or even PC as a cluster to achieve the effect of minicomputer + large commercial DB, reduce a lot of capital investment and reduce operating costs, why not do it? So, we choose Sharding and embrace Sharding.
How to do data segmentation

Data segmentation can be physical. The data is distributed to different DB servers through a series of segmentation rules, and the specific database is accessed through routing rules. In this way, each access surface It is not a single server, but N servers, so that the load pressure of a single machine can be reduced.

Data segmentation can also be in the database. The data is distributed to different tables in a database through a series of segmentation rules. For example, the article is divided into sub-tables such as article_001 and article_002, and several sub-tables are horizontally combined to form a composition. Logically a complete article table, the purpose of this is actually very simple. For example, for example, there are 5000w pieces of data in the article table. At this time, we need to add (insert) a new piece of data to this table. After the insert is completed, the database will re-build an index for this table, and 5000w rows of data will be created. The system overhead of indexing can not be ignored. But on the other hand, if we divide this table into 100 tables, from article_001 to article_100, 5000w rows of data are averaged, and each sub-table has only 500,000 rows of data. At this time, we send a table with only 50w rows of data. After inserting the data, the index building time will decrease by an order of magnitude, which greatly improves the runtime efficiency of the DB and improves the concurrency of the DB. Of course, the benefits of splitting tables are not known, and there are also many obvious benefits such as lock operations such as write operations.

In summary, sub-library reduces the load of single-point machines; sub-table improves the efficiency of data operations, especially the efficiency of Write operations. So far we have not touched on the issue of how to divide. Next, we will elaborate and explain the segmentation rules in detail.

As mentioned above, in order to achieve horizontal segmentation of data, there must be redundant characters in each table as the segmentation basis and mark field. In normal applications, we choose user_id as the distinguishing field. Based on this, there are The following three methods and rules of sub-database: (of course, there are other methods)

(1) The

user_id of the segment number is 1~1000 corresponding to DB1, 1001~2000 corresponding to DB2, and so on;

Advantage: Partial migration is possible

Disadvantages: uneven data distribution

(2) hash modulo partition

Hash the user_id (or directly use the value of the user_id if the user_id is numeric), and then use a specific number. For example, if the application needs to divide a database into 4 databases, we use the number 4 to match the user_id The hash value is modulo operation, that is, user_id% 4. In this case, there are four possibilities for each operation: when the result is 1, it corresponds to DB1; when the result is 2, it corresponds to DB2; when the result is 3, it corresponds to DB3; the result When it is 0, it corresponds to DB4. In this way, the data is distributed evenly among the 4 DBs.

Advantages: Evenly distributed data

Disadvantages : It is troublesome to migrate data, and data cannot be allocated according to machine performance

(3) To save the database configuration in the authentication database

is to establish a DB, which separately saves the mapping relationship between user_id and DB, and each time the database is accessed, the Every time we need to query the database once to get the specific DB information, and then we can perform the query operation we need.

Advantages: Strong flexibility, one-to-one relationship

Disadvantages : One more query is required before each query, and the performance is greatly reduced

The above are the three methods we choose in general development, and some complex projects may use a mixture of these three Way. Through the above description, we also have a simple understanding and understanding of the rules of sub-library. Of course, there will be better and more complete sub-library methods, and we need to continue to explore and discover.
Chapter 3 Basic outline of the research of this topic The

distributed data scheme provides the following functions:

(1) Provide sub-database rules and routing rules (RouteRule for short);

(2) Introduce the concept of cluster (Group) to ensure high availability of data;

(3) Introduce load balancing policy (LoadBalancePolicy for short LB);

(4) Introduce the cluster node availability detection mechanism to regularly detect the availability of single-point machines to ensure the correct implementation of the LB strategy to ensure a high degree of system stability;

(5) Introduce read/write separation to improve data availability Query speed;

only the data layer design of sub-database and sub-table is not perfect. When we adopt the database segmentation scheme, that is to say, there are N machines to form a complete DB. If a machine goes down, only one-nth of the data of a DB cannot be accessed. This is acceptable to us. At least it is much better than the situation before the split. It is not that the entire DB cannot be accessed. .

In general applications, the inaccessibility of data caused by such a machine failure is acceptable. Suppose our system is a high-concurrency e-commerce website? The economic loss caused by the downtime of a single-node machine is very serious. That is to say, there are still problems with our solution, and the fault-tolerant performance cannot stand the test. Of course, problems always have solutions. We introduce the concept of cluster, here I call it a group, that is, for each sub-database node, we introduce multiple machines, and the data saved by each machine is the same. Generally, these multiple machines share the load. In case of downtime, the load balancer will distribute the load to the down machine. In this way, the problem of fault tolerance is solved.

As shown in the figure above, the entire data layer consists of three clusters, Group1, Group2, and Group3. These three clusters are the result of horizontal data segmentation. Of course, these three clusters also form a DB that contains complete data. Each group includes 1 Master (of course, there can be multiple Masters) and N Slaves. The data of these Masters and Slaves are consistent. For example, if one of the slaves in Group1 is down, there are still two slaves that can be used. Such a model will never cause a problem that some data cannot be accessed unless all the machines in the entire group are down, but Considering that the probability of such a thing happening is very small (unless it is powered off, it is unlikely to happen).

Before the introduction of the cluster, our query process is roughly as follows: request the data layer, and pass the necessary sub-database distinguishing field (usually user_id). The data layer routes to a specific DB according to the distinguishing field, and performs data operations in this determined DB.

This is the case without the introduction of the cluster. What would the introduction of the cluster look like at that time? In fact, the rules and policies on our routers can only be routed to specific groups, that is, they can only be routed to a virtual group, which is not a specific physical server. The next job to be done is to find a specific physical DB server for specific data operations.

Based on the requirements of this link, we introduced the concept of load balancer (LB). The responsibility of the load balancer is to locate a specific DB server. The specific rules are as follows: The load balancer will analyze the read and write characteristics of the current SQL. If it is a write operation or an operation that requires strong real-time performance, it will directly assign the query load to the Master, and if it is a read operation, it will be allocated through a load balancing strategy. A Slave.

The main research direction of our load balancer is the load distribution strategy. Usually, load balancing includes random load balancing and weighted load balancing. Random load balancing is well understood, which is to randomly select a slave from N slaves. Such random load balancing does not consider machine performance, it defaults to the same performance for each machine. If this is the real situation, it is understandable to do so. What if this is not the case? When the physical performance and configuration of each slave machine are different, it is very unscientific to use random load balancing regardless of performance, which will bring unnecessary high load to machines with poor machine performance, or even It brings the danger of downtime, and the high-performance database server cannot give full play to its physical performance. Based on this consideration, we introduced weighted load balancing, that is, through a certain interface in our system, we can assign a weight to each DB server, and then run the LB according to the proportion of the weight in the cluster. A certain percentage of load is given to the DB server. Of course, the introduction of such a concept will undoubtedly increase the complexity and maintainability of the system. There will be gains and losses, and there is no way we can escape.

With a sub-library, a cluster, and a load balancer, will everything be fine? Things are far from being as simple as we thought. Although with these things, we can basically guarantee that our data layer can withstand a lot of pressure, but such a design cannot completely avoid the harm of database downtime. If slave2 in Group1 is down, the LB of the system cannot know it, which is actually very dangerous, because the LB does not know, it will think that slave2 is available, so it will still allocate load to slave2. In this way, the problem comes out, and the client will naturally have an error or exception that the data operation fails.

This is very unfriendly! How to solve such a problem? We introduce the availability detection mechanism of cluster nodes, or the availability data push mechanism. How are these two mechanisms different? First of all, let’s talk about the detection mechanism. As the name suggests, even if the detection is my data layer client, I will try to make availability of each database in the cluster from time to time. The principle of implementation is to try to link, or try to access the database port, which can be done. .

What is the data push mechanism? In fact, this problem should be discussed in a real application scenario. Generally, if the DB database of the application is down, I believe that the DBA must know it. At this time, the DBA manually pushes the current state of the database through the program. To the client, that is, the application side of the distributed data layer, at this time, a list of local DB states is being updated. And tell LB that this database node cannot be used, please do not assign load to it. One is an active monitoring mechanism and the other is a passive informed mechanism. Both have their strengths. But both can achieve the same effect. In this way, the problem just assumed will not occur, and even if it does occur, the probability of occurrence will be minimized.

The Master and Slave mentioned in the above text, we did not do too much in-depth explanation. A Group consists of 1 Master and N Slaves. Why do it? The Master is responsible for the load of the write operation, that is to say, all the write operations are performed on the Master, and the read operations are allocated to the Slave. In this way, the efficiency of reading can be greatly improved. In general Internet applications, after some data surveys, it is concluded that the ratio of read/write is about 10:1, which means that a large number of data operations are concentrated in read operations, which is why we have multiple slaves. s reason.

But why separate read and write? Developers who are familiar with DB know that write operations involve locks, whether row locks, table locks, or block locks, which reduce the efficiency of system execution. Our separation is to concentrate write operations on one node, while read operations are performed on other N nodes, which effectively improves the read efficiency and ensures the high availability of the system from another aspect.
Reprinted from http://zhengdl126.iteye.com/blog/419850

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326936180&siteId=291194637