Database must-know series: data partitioning and sharding

Author: Zen and the Art of Computer Programming

1. Background introduction

Overview

With the rapid development of emerging technologies such as the Internet, mobile Internet, and cloud computing, massive data processing has become one of the major problems faced by today's enterprises. How to store, process and quickly query massive data is an important part of the architecture design of large websites. Database partitioning and sharding are key technical means to solve big data management problems. This article will explain both in detail and share them with actual cases.

Introduction to partitioning and sharding

Partition

Partitioning refers to dividing data into different groups or tables according to business rules. Each partition only stores and processes its own data. This can improve database performance, especially when reading and writing. For example, in the order table, split the historical data into sub-tables of different months according to the order date, which can effectively improve query efficiency. Generally, a table can only contain up to 1024 partitions.

Fragmentation

Sharding is to split a distributed database horizontally and deploy each shard on different servers. This can both increase system capacity and effectively avoid single points of failure. After adopting the sharding method, records in the same shard will be assigned to the shard with the same shard key, so server resources can be used to make full use of hardware performance. In general, a distributed database is usually composed of multiple shards, which can be scaled out or up to provide higher throughput and availability.

Advantages and Disadvantages of Partitioning and Sharding

Advantages of Partitioning and Sharding

  • Data redundancy: Through partitioning, data can be divided into multiple independent subsets and stored in different physical devices to achieve data redundancy; through sharding, data can be distributed to multiple physical nodes to reduce A single node handles the pressure while also providing horizontal scalability.
  • Load balancing: If the application has read and write separation requirements, primary and secondary replication of data can be achieved through partitioning; load can be evenly distributed to multiple physical nodes through sharding, thereby improving the throughput and processing capabilities of the overall system.
  • Easy maintenance: When the data changes, you only need to modify the corresponding partition or shard. The data of other partitions or shards will not be affected, and no additional losses will be caused.
  • Improve availability: When a partition fails, it does not affect the normal operation of other partitions, thus ensuring system availability.

Disadvantages of Partitioning and Sharding

  • Data migration complexity: Since partitioning and sharding are technical means introduced to improve database performance, data migration may be very complex and time-consuming for databases with large amounts of data. At the same time, partitioning and sharding cannot completely replace index optimization and database design-related optimization measures.
  • Data distribution rules that need to be considered: In many cases, data distribution rules are often relatively complex, such as dividing by range, dividing by specific fields, dividing according to certain algorithm rules, etc., so it is necessary to choose an appropriate data distribution strategy based on business conditions.
  • Performance overhead caused by partitioning and sharding: Partitions and shards must be divided and allocated, which means that data insertion, deletion, update and other operations will involve data migration, which will inevitably bring performance overhead. In addition, there will be certain difficulties in supporting transactions.

scenes to be used

There are mainly four applicable scenarios for database partitioning and sharding:

  • Divide data into multiple subsets according to business rules: The most widely used scenario in this scenario is relational databases, such as the partition function in MySQL. It divides the data into several partitions according to the range of the partition column value, and then in each partition Indexes are established within partitions to achieve high performance for range queries and related queries. In addition, partitions can also be used to store data in layers, such as storing popular data in one partition and unpopular data in another partition, thereby reducing disk IO pressure.
  • Horizontal segmentation of data: This scenario is common in search engines, distributed file systems, cache systems, etc. They can all distribute data to different machines to improve the performance of the overall system. For example, Baidu's search engine divides search result data into multiple subsets according to domains and stores them in different data centers to reduce network latency and response time and improve user experience.
  • Distribute data to different servers: For example, MongoDB provides the sharding function, which allows a MongoDB cluster to be distributed to different servers to achieve horizontal expansion. At the same time, Apache Hadoop also supports the distributed file system HDFS, which can distribute file data to multiple servers to achieve fault tolerance and reliability of data storage.
  • Simulate a distributed database on a single physical node: Although partitioning and sharding can improve system performance, they can also cause single points of failure, especially when multiple copies are used to deal with single points of failure. At this time, a database system based on a master-slave replication architecture can be used to alleviate the single point of failure problem.

Basic principles of partitioning and sharding

Partitioning principle

What is a partition?

Partitioning divides data into different blocks or subsets, making queries and operations faster, easier to control, and achieving high availability and scalability.

Why partition?

When the amount of data in a single table is too large, the query and write performance of the database will be limited. To solve this problem, the data can be divided into multiple blocks, and each block only stores and processes its own information, so that query and write operations can only be performed on the data set that currently needs to be accessed. In addition, the scalability of the system can also be improved through partitions, the processing capabilities of the system can be improved by adding new partitions, and dynamic scalability can be achieved by removing partitions.

Advantages of partitioning
  • Improve query performance: Through partitioning, the data can be divided into multiple independent small collections, which can reduce the time of scanning the table and speed up the query.
  • Data redundancy can be achieved: Through partitioning, data storage redundancy can be achieved and data can be saved on multiple disks to avoid single disk failure or performance degradation.
  • It can improve the availability of the system: when a partition fails, it will not affect the normal operation of other partitions, thereby achieving high availability of the database.
    Disadvantages of Partitioning
  • Creating a partitioned table takes a long time: Creating a partitioned table requires rebuilding the table first, which takes some time and affects the availability of the database.
  • Not conducive to query analysis: Whether it is a simple or complex query, it is impossible to accurately assess the amount of data in each partition, so that resource allocation cannot be reasonably arranged. At the same time, when the amount of data is large, it will occupy more storage space.

    Sharding principle

    What is sharding?
    Sharding distributes data to multiple nodes.
    Why sharding?
    The performance bottlenecks of stand-alone databases are CPU and memory. As the amount of data increases, the performance of CPU and memory becomes worse and worse. In order to improve the processing performance of the database, the data needs to be distributed to multiple computers, each computer has a better hardware configuration, which is called sharding.
    Advantages of sharding
  • Improve system performance: Through sharding, data can be distributed to multiple computers. Each computer has better hardware configuration, which can make full use of hardware resources and improve processing performance.
  • Convenient horizontal expansion: When adding a new node, you only need to make simple configurations on the existing nodes, and there is no need to repartition the entire database to achieve real-time horizontal expansion of the database.
  • Data protection can be achieved: when a computer fails, the services of other nodes will not be affected, thereby achieving high availability of the database.
  • More flexible data distribution: Through sharding, data can be dynamically offloaded, and the data distribution can be adjusted according to the load, so that the load is evenly distributed to each node.
    Disadvantages of sharding
  • Sharding needs to consider data distribution rules: Sharding needs to determine the sharding key value, which requires considering business logic and data distribution rules.
  • Fragmentation increases complexity and development difficulty: Due to data fragmentation, data insertion, update, and deletion operations require data synchronization and consistency, which increases development difficulty.
  • Application support for sharding requires special handling: applications need to adapt to sharding in order to access data correctly.

Summary of partitioning and sharding

Partitioning and sharding are two database design techniques that can effectively solve problems such as query, writing, and maintenance under large data volumes. After adopting partitioning and sharding, the storage space of the database can be further expanded, improving the reliability and performance of the system. However, too many partitions and shards will bring some additional complexity and management problems. Therefore, in actual use, reasonable design needs to be combined with business scenarios.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133594901