[Translated] Data fragmentation how it works in a distributed SQL database

How is data fragmentation function in a distributed SQL database

Today, businesses of all sizes are embracing high-speed modern user-oriented applications as part of their broader digital transformation strategy. Therefore, these applications depend RDBMS (relational database infrastructure), and now you need to support a larger amount of data and transaction volume. However, in this scenario, a single RDBMS typically quickly reach an overload state. Data slice is one of the most common architecture for solving such a problem, it is possible to RDBMS better performance and higher scalability. In this article, we will explore what is fragmented, how fragmented to extend the merits of the database, as well as several common sliced architecture. We will also explore in a distributed SQL database, for example YugaByte DB is how to achieve data fragmentation.

Data points in the end is what film?

A slice cut into the large table of data fragmentation process, the divided data block will be distributed across multiple servers. Data slice must be horizontal segmentation, each slice is a subset of the entire data set, part of the overall charge of their respective workloads. The central idea of this approach is that it would be otherwise difficult to place large data monomers, dispersed into a database cluster in. Also called slice level segmentation , vertical and horizontal segmentation segmentation distinction from the traditional database table. A database can be vertically segmented (different columns in the table in the database dispersed), can also be sliced horizontally (the dispersion of the plurality of different rows to the database nodes).

Figure 1: Vertical and horizontal segmentation segmentation (Source: Medium)

Why should the database fragmentation?

With the expansion of business scale, single RDBMS-dependent business applications will reach a performance bottleneck. Limited CPU performance, main memory and secondary memory size, the performance of the database can suffer day. In unfragmented a database, a read operation and daily operation and maintenance of response speed becomes extremely slow. When we want to provide more resources for running database operations, the vertical expansion (also called expansion) there is a series of defects, will eventually reach the point where more harm than good.

On the other hand, to form horizontal segmentation means more computing resources to deal with queries, you will get a shorter response time and the ability to create the index more quickly. Slicing through continuous data volume and workload balance between the additional nodes, more efficient use of resources in the new expansion. Not only that, to maintain a set of smaller and more affordable than inexpensive servers to maintain a large multi-server.

In addition to addressing the problem of scalability, fragmentation may also address the potential problem of unplanned downtime. When the server is not a slice of downtime, all data will become inaccessible, it would be a disaster. However, fragmentation can be a good solution to this problem. Even if one or two nodes dawdle away, there are other nodes retain the remaining fragments, so long as they are in different domains wrong, they are still able to read and write data to provide services. Overall, fragmentation can increase the storage capacity of the cluster, to shorten the processing time, and with respect to the case of vertical expansion consume less capital, provide higher availability.

Manual tile hazards

For large amounts of data applications, the slice contains a series of built form and load balancing are fully automated deployment will reap huge gains. Unfortunately, like Oracle, PostgreSQL and MySQL database of these monomers, and even some of the newer distributed SQL database, such as Amazon Aurora, does not support automatic fragmentation. This means that if you want to continue using these databases, you must manually fragmented at the application layer. This greatly increases the difficulty of development. In order to know how your data is assigned, your application requires an additional set of fragments of code, and need to know the source of the data. You also need to decide what fragmentation method is used, how many fragments will eventually need, and how many nodes. Once your business has changed, fragmentation and fragmentation way primary key will change accordingly.

Manual fragmentation of one of the major challenges is uneven slice. Disproportionate distribution of data will lead to fragmentation becomes unbalanced, the other nodes overload This means that when some of the nodes may be idle. Because some nodes overload may be a drag on overall response rate and crash the service, we should try to avoid too much data stored in a slice. This problem may also be concentrated in a small fragment, because of the small fragment diversity means having data dispersed to a very small number of fragmentations. Although this development environment and test environment is acceptable, but a production environment is not allowed. Uneven distribution of data, part of the node overload and too little can lead to depletion of the data distribution fragmentation of resources and services.

Finally, make the manual slicing operation complicated. Now we need to be backed up across multiple servers. In order to ensure that all fragments have the same configuration, data migration, and change table structure now needs to be coordinated more careful. In the absence of adequate optimization, database join operations becomes less efficient and difficult to implement in a plurality of servers.

Conventional automatic sheet framework

Slicing a long time, over the years developed many fragmented architecture and implementation for deployment in a wide range of systems. In this section, we will discuss three of the most common implementation.

Hash based fragmentation

Hash based fragmentation fragment using the primary key to generate a number of hash values, the hash values ​​are used to determine where this data is stored in a. By using a common Ketama hash algorithm, the hash function can be averaged among servers share data, in order to reduce the overload of some nodes. In this method, those similar to the primary key data piece is less likely to be assigned in the same slice. This architecture therefore very suitable for targeted data manipulation.

Figure 2: Hash-based slice (Source: MongoDB document)

Based on the range of the slice

The scope of segmenting data on the basis of slice, the reference data values. Fragments similar primary key data easier falls in the same range, and therefore more likely to fall into the same slice. Each slice must be kept the same structure as the original database. Data fragmentation becomes very simple, as discrimination data and put them into the proper scope of the slice as easy.

Figure 3: based on the range of fragment

It enables read data within a continuous range, or range query based on a range of more efficient fragmentation. However, this method requires a user fragmentation preselected slice primary key, if the key piece of the main points not selected, some nodes may cause overload.

A good principle is to select those cardinality greater repetition rate lower key fragments as the primary key, these keys are usually very stable, does not increase and decrease, it is unchanged. Without the proper choice slice primary key, the data is partitioned unequal slice specific data access frequency is higher than the other data, so that those fragments larger workload bottleneck.

The method of solution over the uneven aliquot and merge sheet is automated fragmentation. If fragmentation becomes excessive or in which a line is frequently accessed, it is best to this large fragment will then finer fragments and redistributed evenly distributed among the nodes in these small fragments . Similarly, when too many small slices, we can do the opposite.

Location-based fragment

In location-based slice, the data (geographical location and the value in the column) column in accordance with individual users to those fragments of different slices are assigned to the corresponding region. For example, there is a deployment in the US, UK and European clusters, we can value in this column is based on the user table Country_Code, and in accordance with GDPR (General Data Protection Regulation) to be fragmented into place.

The fragment YugaByte DB

YugaByte DB is automatically provided with a highly resilient and fragmentation of high-performance distributed SQL database, developed by Google Spanner. It is currently supported by default based on a hash slice manner. It is an active project updates, and location-based and range-based fragmentation will be added to the end of this year. In YugaByte DB each data piece is referred to as sub table (tablet), which are assigned respective sub-table in a server.

Hash based fragmentation

For slice-based hash table is allocated 0x0000 to 0xFFFF (total range. 2B) in the hash space which accommodates approximately 64KB sub-table in a large data set or cluster. We look at Figure in the Four Table 16 minute movie of the table. 2B used herein a full size space to accommodate hash slice, and it is divided into 16 portions, each portion corresponding to one sub-table.

Figure 4: In YugaByte DB hash based fragmentation

In the write operation, the primary key is the first to be converted to the corresponding keys and their hash values. This operation is achieved by collecting data is available in the child table. (Figure 5)

Figure 5: In Yugabyte DB decide which sub-table use

For example, as shown in FIG six, you now want to insert a key k in the table, the data values ​​of v. First, the value of k may be calculated based on a hash value of the key, then queries the database and the sub-sub-table corresponding to the table server. Finally, the request is passed directly to the appropriate server for processing.

Figure 6: YugaByte DB stored in the value k

Based on the range of the slice

SQL table autoincrement and autodecrement set the primary key in the first column. This allows data to be stored in accordance with a preselected sequence in a single slice (i.e., sub-table). Currently, the project team is developing dynamically split the child table (based on various criteria, such as the scope and boundaries of the load), and clearly indicated for a specific range of enhanced SQL syntax of these functions.

to sum up

Data fragmentation is for the construction of large data sets and scalable solution to meet the needs of commercial applications. There are many fragmented data architecture for us to choose, each of which provides a different function. Before deciding what kind of architecture with, we need to clearly list your project requirements and expected load. Due to the complexity significantly increase application logic, we should try to avoid in most cases manual fragmentation. YugaByte DB is an automatic fragmentation with distributed SQL database, which currently supports hash-based slice, and based on the scope and location-based fragmentation will soon be able to use. You can view this tutorial to learn the automatic fragmentation YugaByte DB's.

The next step?

  • Depth comparison YugaByte DB and CockroachDB difference is, Google Cloud Spanner and the MongoDB.
  • Start using YugaByte DB, use it in macOS, Linux, Docker and Kubernetes in.
  • Contact us understand the charges or make an appointment certificate and a technical interview.

If you find there is a translation error or other areas for improvement, welcome to Denver translation program to be modified and translations PR, also obtained the corresponding bonus points. The beginning of the article Permalink article is the MarkDown the links in this article on GitHub.


Nuggets Translation Project is a high-quality translation of technical articles Internet community, Source for the Nuggets English Share article on. Content covering Android , iOS , front-end , back-end , block chain , product , design , artificial intelligence field, etc., you want to see more high-quality translations, please continue to focus Nuggets translation program , the official micro-blog , we know almost columns .

Guess you like

Origin juejin.im/post/5d42867a6fb9a06ac76d915d