Horizontal scaling of pika cluster - so that performance capacity is no longer limited

background

Pika is a persistent large-capacity redis storage service, compatible with most interfaces of string, hash, list, zset, and set (compatibility details), and solves the capacity bottleneck of insufficient memory due to the huge amount of stored data in redis. Users can migrate from redis to pika service without modifying any code. It has good compatibility and stability, and is used by more than 3000 instances within 360 companies, and the github community exceeds 3.8K stars. Since the capacity of a single machine pika is limited by the capacity of a single hard disk, the 360 ​​company's business and community have an increasingly strong demand for distributed pika clusters. Therefore, we launched a native distributed pika cluster and released pika version v3.4. Compared with the pika+codis cluster solution, codis's support for pika creation and management of slot operations is not friendly, and requires a lot of intervention by operation and maintenance personnel. The pika native cluster does not require additional deployment of the codis-proxy module.

Cluster Deployment Structure

Taking a cluster of 3 pika nodes as an example, the cluster deployment structure is shown in the figure above:

  1. Deploy the Etcd cluster as the meta information store for the pika manager.
  2. Deploy pika manager on the 3 physical machines respectively, and configure the service port of Etcd. The Pika manager will register with etcd and compete to become the leader. Only one pika manager in the cluster can become the leader and write cluster data to etcd.
  3. Deploy the pika nodes on the three physical machines, and then add the information of the pika nodes to the pika manager.
  4. For load balancing, register the service port of pika with LVS.

data distribution

In order to isolate data according to business, Pika cluster introduces the concept of table, and different business data is stored in different tables. The business data is stored in the corresponding slot according to the hash value of the key. Each slot will have multiple copies to form a replication group. All slot replicas in a replication group have the same slot ID, one of the slot replicas is the leader, and the other replicas are followers. To ensure data consistency, only the leader provides read and write services. The pika manager can be used to schedule and migrate the slots, so that the data and read and write pressure are evenly distributed to the entire pika cluster, thus ensuring full utilization of the entire cluster resources and horizontal expansion and contraction according to business pressure and storage capacity needs .

pika uses rocksdb as the storage engine, and each slot will create a corresponding rocksdb. Each slot in pika supports reading and writing 5 data structures of redis. Therefore, it is very convenient to migrate data, just migrate the slot in pika. However, there is also the problem of excessive resource consumption. The current pika will create 5 rocksdb by default when creating a slot, to store 5 kinds of data structures respectively. When a table contains a large number of slots or a large number of tables are created, a single pika node will contain multiple slots, thereby creating too many rocksdb instances and occupying too much system resources. In subsequent versions, on the one hand, it will support the creation of one or more data structures according to business needs when creating slots, and on the other hand, it will continue to optimize the blackwidow interface layer in pika to reduce the use of rocksdb.

data processing

  1. When the pika node receives a user request, the parsing layer processes and parses the redis protocol, and sends the parsed result to the router layer for judgment.
  2. The router finds the slot corresponding to the key according to the hash result of the key, and determines whether the slot is on the local node.
  3. If the slot where the key is located is on another node, create a task according to the request and put it in the queue, and forward the request to the peer node for processing. When the task receives the processing result of the request, it returns the request to the client.
  4. If the slot where the key is located belongs to the local node, the request is directly processed locally and returned to the client.
  5. For write requests that need to be processed locally, first write the binlog through the replication manager module, and asynchronously replicate it to other slot replicas. The process layer writes to the leader slot according to the consistency requirements. Among them, blackwidow is the interface encapsulation of rocksdb.

We embed the proxy in pika and do not need to be deployed separately. Compared with the redis cluster, the client does not need to perceive the existence of the proxy, and only needs to use the cluster like a single machine. The service port of the pika node can be mounted to the LVS to achieve load balancing of the pressure on the entire cluster.

log replication

The replication manager module in pika is responsible for the master-slave synchronization of logs. In order to be compatible with redis, pika supports non-consistent log replication, the leader slot writes data directly in the db without waiting for the ack response from the follower slot. At the same time, it also supports log replication in the raft consistency protocol mode. It needs to meet the ack of most copies before writing to db.

Inconsistent log replication

The processing flow in non-consistent scenarios is as follows:

  1. The processing thread receives the client's request, directly locks it, writes the binlog and operates the db.
  2. The processing thread returns the client response.
  3. The auxiliary thread sends a BinlogSync synchronization request to the follower slot to synchronize the log.
  4. The follower slot returns BinlogSyncAck to report synchronization.
Consistent log replication

In a consistent log replication scenario:

  1. The processing thread writes the client request to the binlog file
  2. Synchronize to the slave library by sending a BinlogSync request
  3. Return BinlogSyncAck from library to report synchronization status
  4. Check that the response from the library meets the majority and write the corresponding request to the db
  5. Return the response to the client

Cluster metadata processing

On the basis of codis-dashboard, we developed the pika manager (PM for short), which is used as the global control node of the entire cluster to deploy, schedule and manage the cluster. The metadata and routing information of the entire cluster are stored in the PM.

  • The function of creating multiple tables in a cluster is added to facilitate business data isolation according to different tables.
  • The number of slots and replicas can be specified when creating a table, which is convenient for operation and maintenance to create tables according to the business scale and fault tolerance.
  • Logically change the concept of group to replication group, so that the original process-level data and log replication is transformed into slot-level replication.
  • Supports creating a password when creating a table to isolate business usage. The client only needs to execute the auth and select statements to authenticate and operate the specified table.
  • It supports slot migration, which is convenient for expansion and contraction according to business needs.
  • Integrating the sentinel module, PM will continuously send heartbeats to the pika nodes in the cluster to monitor the survival status. When the PM finds that the leader slot is down, it will automatically increase the slave slot with the largest binlog offset to the leader.
  • The storage backend supports metadata writing to etcd to ensure high availability of metadata.
  • The pika manager becomes the leader by constantly competing for locks with etcd to achieve high availability of the pika manager.

postscript

The launch of the pika native cluster solves the limitation of the disk capacity of a single machine pika, and can be horizontally expanded according to the needs of the business. But there are still some defects, such as the lack of raft-based internal automatic main selection function, range-based data distribution, and functions such as display boards for monitoring information. We will address these issues one by one in subsequent releases.

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324133022&siteId=291194637