Xiaohongshu KV storage architecture: trillions of data and cross-cloud multi-activity are not a problem

RedKV is a distributed NoSQL KV storage system based on NVMeSSD independently developed by Xiaohongshu. It supports two management and control architectures, centerless and centralized, and is designed to solve the company's KV storage needs for real-time disk placement. RedKV1.0 is based on the Gossip protocol for node management and has been used on a large scale throughout the company. The real-time QPS is close to 100 million/second and the storage capacity is at the level of several petabytes. RedKV2.0 adopts a central Shard management architecture to support global multi-cloud and multi-copy online elastic scaling, remote disaster recovery and service switching in seconds.

Through layered optimization, compared with similar open source products, RedKV's aggregate write throughput is increased by an average of 3 times and read throughput by 1.5 times; compared with HBase, the cost is optimized by nearly 40%. RedKV is partially compatible with the Redis protocol, and main data types such as string/hash/zset support most of the company's online storage businesses, optimizing the cost issues caused by early Redis cluster deployment and the performance and stability issues caused by HBase. . The data interoperability capabilities of RedKV and Hive data warehouses provide a solution for offline data services.

1. Development history of Xiaohongshu storage

Xiaohongshu is a life recording and sharing platform for young people. Users can record their daily life and share their lifestyle through short videos, pictures and texts. Under the current business model, user portrait data and note data are used for risk control and content recommendations. The stored data has object-attribute characteristics and multiple dimensions. The volume of portrait data has reached tens of terabytes. Online businesses have very high P99 latency requirements for accessing portrait and note data.

Before 2020, the NoSQL storage products selected by the company mainly included: Redis and HBase. With the rapid growth of the company's DAU, the early storage solutions encountered the following challenges:

Redis cluster is mainly suitable for caching scenarios. Enabling real-time disk writing of AOF data has a relatively large impact on performance. At the same time, each node needs to mount an additional cloud disk to store AOF. When the cluster nodes and storage capacity are limited, setting the data amount of a single node too high will cause the failover time of node data after a failure to be too long. Setting the data amount of a single node small will cause the gossip protocol to have a high number of nodes under high stability requirements. Limited, and considering the pressure of burst traffic, the Redis cluster needs to make some space reservations during deployment, which brings high costs.

  • HBase, as an ecologically complete NoSQL storage system, also produces many performance and stability problems under high QPS, such as: stability is difficult to guarantee when Zookeeper is under high pressure (node ​​detection, service registration, etc. all rely on Zookeeper); HBase The data files and WAL log files are directly written to HDFS. After a node failure, the replay speed of WAL on HDFS is slow; Java GC will cause Zookeeper to accidentally kill the RegionServer and generate glitches; Major Compaction will cause I/O to surge, resulting in a long tail effect; Limited by the complexity of HDFS, black-box operation and maintenance is more difficult for engineers; in Xiaohongshu’s actual business practice, the HBase latency under millions of QPS is not ideal, and the core data is stored in large memory models, which also causes The problem of high costs.

With the continuous growth of business, open source storage products can no longer meet the company's business development needs. The company needs a stable and high-performance KV system to support internal business. On the one hand, it must meet the business's functional and performance needs, and on the other hand, it must meet the business's functional and performance needs. In terms of cost optimization.

2. Xiaohongshu’s business needs

1. High QPS and low latency reading characteristics

1) Feature data storage scenario

The write bandwidth reaches tens of GB/s, requiring high real-time write performance and read performance.

2) Image caching scenario

The amount of data is large and low reading latency is required. A small amount of data loss in failure scenarios is acceptable.

3) High performance (P99 < 10ms)

  • Model data storage service. Recording user training model data over a period of time requires very high P99 latency, and the data volume is tens of TB.
  • Deduplication storage service. The amount of data is tens of TB, P99<10ms, P999<20ms.
  • Risk control data storage service. QPS currently reaches tens of millions, P999 < 30ms.

2. Low-cost caching features

1) Benchmarking Redis

Compatible with the Redis protocol, the performance is slower than Redis, but the resource cost is reduced by 50%+.

2) Typical scenarios

Advertising keyword storage and anti-fraud business solves the storage model of large data volume and low QPS.

3. NoSQL storage characteristics

1) Benchmark HBase

  • It supports multiple versions of data, column storage and row retrieval, and other features, reducing the cost by 30%+ compared with HBase, and increasing P99 latency by 6 times.
  • Supports KKV level TTL.
  • Strong consistency: RedKV1.0 currently uses master-slave dual copies. Data is written successfully. You can configure the synchronization mode to ensure that the two copies are written successfully. The read-master and write-master ensure strong consistency. For scenarios with high write performance requirements, asynchronous writing can be turned on, and the write master will return if successful, relying on incremental synchronization of slave node data.

2) Typical scenarios

  • Risk control services. Real-time query has extremely high requirements on P999. HBase can no longer meet the performance requirements under tens of millions of QPS, and the delay jitter is relatively large.
  • Image storage service. The data has many dimensions, many business parties need to read the fields, and is sensitive to latency requirements.

 Recommend to everyone a hard-core KV storage practical project to easily impress the interviewer!

Click to learn → C++KV storage hard-core project (including source code)

3. Architecture design

The overall architecture of RedKV is divided into three layers. The access layer is compatible with the Redis protocol and supports community version SDKs in various languages ​​​​and company-customized middleware versions; the access agent layer supports tens of millions of QPS reading and writing capabilities and stateless expansion; the storage layer provides Highly reliable reading and writing services. The RedKV1.0 architecture is shown in Figure 1 below. Below we introduce the three-layer components in detail.

Figure 1 Overall architecture of RedKV1.0 

1. Client access layer

After the RedKV cluster is deployed, service discovery is performed through the Service Mesh component provided within the company to provide services to the Client.

2、Proxy

The Proxy layer consists of a stateless CorvusPlus process. It is compatible with the old Redis Client, and its capacity expansion and contraction and upgrade are insensitive to clients and back-end clusters. It supports multi-threading, IO multiplexing and port multiplexing features. Compared with the open source version, CorvusPlus enhances self-protection and observability features, and implements online configurable features:

  • Proxy current limit
  • Data online compression
  • Thread model optimization
  • backup-read optimizes long tail
  • Big key detection

1) Proxy current limit

Xiaohongshu currently has many business models, and the client behavior is unpredictable. Possible publishing errors, system problems and network jitters may cause client retries, and sudden QPS will affect service stability. Under high QPS pressure, the Proxy handles client read and write timeouts, and a large number of retries will cause an avalanche. During peak business periods, the bandwidth of a single Proxy may exceed the machine's access bandwidth limit, and the storage cluster can only guarantee stable and reliable service within limited resources. Serve. For this type of scenario, we need to ensure that when traffic is overloaded, the Proxy and RedKV services will not be crashed and high availability can be guaranteed.

Based on the above problems and goals, compared with the native Redis Cluster mode, RedKV's token bucket-based flow control algorithm supports multi-dimensional flow limitation on the number of connections, bandwidth and QPS. Under high QPS, our Proxy current limiting prevents avalanches, as shown in Figure 2; in large-bandwidth scenarios, we optimize the delay, as shown in Figure 3.

Figure 2 Current limiting in avalanche scenarios

 Figure 3 Current limiting in large bandwidth scenarios

2) Data online compression

The Proxy layer itself only does routing and forwarding, and consumes very little CPU. In large bandwidth scenarios, we can make full use of Proxy's CPU resources to optimize bandwidth and glitches. When parsing the Redis protocol, the LZ4 algorithm is used to compress the written data online and decompress it online when reading. In the recommended cache usage scenario, network bandwidth and storage space are compressed by more than 40% (Figure 4), and the overall latency does not drop significantly. Because network bandwidth and data writing and reading are reduced, latency glitches are also reduced.

Figure 4 Bandwidth optimization after proxy is turned on and compressed 

3) Optimization of thread model

Proxy uses IO multiplexing technology. Each connection maintains a request processing queue and response queue, and returns them to the client in order. After Proxy receives the response from RedKV Server, if it does not receive the return of all cmd sent, it will wait for the return of all cmd before sending it to the client. This mode is very unfriendly for reading scenarios. After modification, if the cmd before a certain cmd has responded normally, it can respond to the client immediately without waiting for all subsequent cmd requests to be completed.

4) backup-read optimizes long tail

In a public network environment, a CVM virtual machine shares a physical machine with multiple other virtual machines. When a customer's CVM occupies a large amount of resources, it can easily affect the P99 latency of other CVMs (due to poor QOS and isolation, SMI interrupts and memory CE). When the network throughput is large, the DPDK of a certain cloud is easily blown up, causing OOB on the host machine. In the internal implementation of RedKV, if the server request is relatively large and the query delay of certain keys is relatively high, it is easy to cause queuing accumulation, or the block cache after compaction fails, which can easily cause an IO long tail. Therefore, RedKV's P99 read latency glitches are difficult to avoid, but glitches occur occasionally. Currently, our master and slave nodes must be deployed discretely on different mother machines, and the possibility of P99 glitches occurring at the same time is very small. Based on this, we implemented the backup read function in the Proxy layer to optimize the p99 delay problem of RedKV.

For the above model, our optimization ideas are:

  • Check node status and past delays
  • Select the node with the best status among the two nodes to send the request
  • Calculate the P99 delay. If the P95 delay exceeds, a certain number of backup read requests will be sent to another node.
  • If either of the two requests returns success, it is successful. If it times out, continue to try again.

Figure 5 Backup-read peak elimination 

Because backup read forwarding does not need to copy memory, the life cycle is guaranteed through indexing, and only messages with a delay exceeding P95 will be checked to see whether backup read can be sent. Therefore, as long as 5% of the messages will be sent twice, it is basically beneficial to the cluster. No added stress. Figure 6 shows that P999 is reduced from 35ms to about 4ms in a cluster, and the effect is very obvious. Compared with the same business scenario of HBase and the client with the same timeout configuration, our solution improves the success rate of the client.

5) Big Key detection

Many of our online clusters will occasionally generate some glitches during business use. Through packet capture, we found that a large part of the glitches here are caused by large keys. In order to screen for such problems, we support observable indicators of large keys in the Proxy layer. Proxy can also count the size of KV when parsing Redis cmd. For a string read type command, the read val value is greater than big-string-size and is judged to be a big key; for a write type command, the requested value is greater than big-string-size and is judged to be a big key; for hash/zset, it is a read Take the total number of kv. By increasing read_size (the total number of bytes read by all read requests) and write_size (the total number of bytes written by all write requests) monitoring, rate(read_size) / rate(total_req_amount) can calculate the average request size. Large Keys and hot Keys are two inevitable scenarios for KV systems. For large Keys, we provide the data compression capability of the Proxy layer; for hot Keys, we perform topK statistics and processing on the Server layer based on the HeavyKeeper algorithm.

Figure 6 Backup-read P999 optimization comparison 

3、RedKV Cluster

The company has many storage demand scenarios. For example, the advertising business stores many tags and data models. It is also a very core business and requires resource isolation. In order to reduce node failures and reduce the explosion radius of data, we adopt a non-centralized management and control architecture for the business here, namely the RedKV1.0 architecture, which can greatly simplify deployment and operation and maintenance. The centerless cluster architecture uses the gossip protocol, and the storage nodes are deployed using multi-process and multi-instance, as shown in Figure 7.

Figure 7 KV Cluster controlled by Gossip 

The recommended model training requires a very large amount of data, a lot of upstream and downstream services, high QPS, and many corresponding cluster nodes. This will trigger gossip jitter in terms of fault handling and capacity expansion. For node management of large clusters, we adopt a centralized management and control architecture, namely the RedKV2.0 architecture. The central architecture based on Shard management can better support data migration and cluster expansion and contraction. Storage nodes adopt single-process multi-instance deployment and can support elastic expansion of the number of replicas in multi-active scenarios, as shown in Figure 8. The relevant components of RedKV2.0 will be introduced in detail in subsequent technical articles.

Figure 8 KV Cluster based on central management and control 

1) Gossip optimization

RedKV1.0 uses the Gossip protocol for communication. When a node fails, the master-slave node switches, and the maximum impact time is 30s. When a node fails, it takes a period of convergence for the normal nodes in the cluster to mark the failed node as a fail state. During this period, the Proxy layer may forward the user request to the node that has failed, causing the request to fail. Reducing the cluster convergence time can effectively reduce the number of erroneous requests at the Proxy layer and improve the stability and availability of the cluster.

RedKV1.0 speeds up view convergence through the following three steps:

  • Detection time optimization

Under normal circumstances, the Redis Gossip protocol will randomly select a node to send a ping packet every 100ms, and update the node's ping_sent value to the time when the ping packet is sent. If the cluster is large and there are many nodes, then the probability of a faulty node being pinged will become smaller, and a ping packet will be sent to the faulty node within a maximum of node_timeout/2. This will cause that when a node fails, the normal nodes in the cluster cannot ping the failed node immediately, and thus cannot immediately detect that the failed node has failed. In order to reduce this time, when a node in the cluster does not receive a pong message from a faulty node for more than 2 seconds, it immediately notifies other nodes to ping the faulty node. In this way, the time from node failure to normal node sending ping to the failed node can be controlled to about 2 seconds.

  • Determine PFAIL time optimization

The current implementation of the Gossip protocol is to set the node status to pfail if no pong message is received after node_timeout (usually 15s). This optimization sets this time to 3s (configurable). If no pong message is received for more than 3s for the first time within 24 hours (configurable), the node will be set to the pfail state. If it occurs frequently within 24 hours, it may be caused by network jitter, and the original path is still used to wait for node_timeout.

  • Reduce the decision time from PFAIL to FAIL

Only when a node receives the PFAIL information from 1/2 nodes in the cluster will the faulty node be judged to be in the FAIL state. The PFAIL information is exchanged through the Gossip protocol, and it takes up to 1/2 node_timeout to notify other nodes. Therefore, in order to accelerate the transition from PFAIL to FAIL, all nodes select a seed node according to unified rules. In addition to randomly sending a node, the PFAIL message will also notify the seed node. In this way, the seed node can learn the PFAIL information of all nodes in the cluster as quickly as possible, thereby marking the faulty node as FAIL and broadcasting it to the cluster.

2)RedKV Server

RedKV Server configures multiple IO threads to listen to one port at the same time to accept connection requests. The number of connections on each thread will be randomly balanced. Each thread only parses the request on its own connection, and hangs the parsed message to the corresponding request queue through the key. Each queue is processed by a Worker thread. In this way, requests for the same key/slot will be processed on the same Worker thread, which avoids locking the key and reduces lock conflicts and thread switching. The data will be re-encoded in the Worker thread and stored in the Rocksdb local storage engine.

The internal thread model of RedKV is as shown in Figure 9:

Figure 9 RedKV Server lock-free thread model 

3) Data storage

RedKV currently supports data types such as string, hash and zset. The data node selects RocksDB as the local storage engine. When the cluster is created, it supports the configuration of multiple copies and discrete deployment of master and slave nodes. Using hashing to store data of continuous slot shards can better avoid hot key problems. Different data types include (MetaKey, MetaValue) and (DataKey, DataValue). The design format is as follows:

  • MetaKey

  • MetaValue

DataKey

  • DataValue

Under the above encoding method, the slot information retained in the key design can be used to flexibly migrate data through slots in expansion and contraction scenarios.

4. Ecological functions

1. Data copy

Different from the way traditional solutions introduce synchronization components, we quickly realize one-way data synchronization and cluster expansion requirements. The overall architecture removes dependence on third-party components and achieves direct replication of RedKV data nodes by extending the Redis replication protocol, such as Figure 10. The limitation of one-way replication is that capacity expansion requires node synchronization based on 2n. After the expansion is completed, the background task deletes data that is not the node based on the key sharding defined in 3.3.3.

In the multi-active deployment mode, one-to-many data replication in multi-cloud clusters uses one-way replication, which greatly intrudes on the performance of the main cluster. Therefore, we implemented a data replication strategy based on central management and control. This strategy supports sharded heterogeneous deployment of multiple clusters. Data is directionally synchronized through Checkpoint. It no longer requires additional background tasks for data elimination. It can well support many-to-many multi-cloud cluster data replication, data destruction and Expansion and contraction.

Figure 10 RedKV data replication 

2. Data batch import

A large amount of Xiaohongshu's offline business data is stored in S3 Hive. Some data will need to be incrementally updated every day, and other data will be eliminated. There are several challenges with this type of scenario:

1) Batch import

For example, Xiaohongshu's note data generally needs to be updated on an hourly or even daily basis, so the business needs a fast batch import function.

2) Quick update

The characteristic of characteristic data is that the amount of data is extremely large. Taking notes as an example, the entire amount of notes is at the TB level. If written through Jedis SDK, the storage cluster needs to support millions of QPS machine resources. At present, the Xiaohongshu data platform supports businesses to directly import data from hive to RedKV through workflow. Generally, it starts writing data in the early morning every day and waits until the evening peak to read a large amount of data. When this method is practiced, it often causes the cluster memory of the RedKV cluster to be OOM, affecting stability.

3) Performance and stability

The data cannot affect the reading performance during the import process.

The implementation plan is shown in Figure 11:

  • Customize the UDTF to obtain the cluster view and data encoding, supporting the RedKV1.0 data format
  • Integrate the original data extraction, encoding, slicing and sorting into a HiveOperator. After the execution is completed, the SST file is output to a specified S3 directory according to the specified OutputFormat.
  • Use the Hadoop distcp tool to transfer data across clouds. Offline bandwidth does not affect online reading and writing services.
  • The node SiderCar of the RedKV cluster serves as a client of object storage. The RedKV node loads the SST of this node and ingests it.

Figure 11 Offline data BulkLoad 

3. Data batch export

Xiaohongshu's business model training data is stored in the RedKV cluster through Hash. The downstream business needs to conduct offline analysis of the training results. It is hoped that RedKV will have the ability to communicate with Hive data. RedKV itself does not support Schema. If you want to import KV data into a Hive table, you need to convert the Hash KKV data into a Table.

The internal data of RedKV is broken up by hash. To import the Hive table, you need to provide the table keyword. First, scan the storage node by prefix scanning, then generate the file recognized by Hive, and finally load it through Hive Load. In order to be better compatible with other spark tasks, we choose the standard parquet column storage file supported by Hive. The entire I/O link is as shown in Figure 12:

图12 RedKV2Hive I/O 

Example: Key writing rules in RedKV start with {tablename}_, such as an artistic table

Data in RedKV is written using hmset:

hmset {person}_1 name John quantity 20 price 200.23hmset {person}_2 name Henry quantity 30 price 3000.451.2.

Through the above writing method, you can configure RedKV2Hive to import the data in KV into the Person table in Hive. If the amount of data in a single table is large, you can use split-table writing, such as dividing the person table into 16 parts.

hmset {person:1}_1 name John quantity 20 price 200.23hmset {person:1}_2 name Henry quantity 30 price 3000.45...hmset {person:16}_100000 name Tom quantity 43 price 234.561.2.3.4.

4. Data backup and recovery

Xiaohongshu’s advertising data is stored in a self-developed distributed KV system. Data security mainly faces the following challenges:

  • In the KV system based on the LSM structure, the space amplification caused by data compaction will double. When the amount of data is large, data backup requires a large-capacity disk.
  • After a single cluster failure, the cluster recovery time is uncontrollable.
  • Backup data relies on third-party systems
  • Advertising systems have relatively high requirements on the ability to recover data in a timely manner, usually at the minute level. In order to solve the above problems, we proposed a centrally controlled master and backup cluster disaster recovery strategy, which can quickly switch to a specific version through the second-level switching of the Proxy access layer.

The implementation plan is shown in Figure 13:

  • First deploy a disaster recovery cluster. The main cluster provides read and write services to the outside world. The disaster recovery cluster saves a specific amount of snapshot data.
  • During off-peak periods, central management and control will regularly send snapshot services to the main cluster based on the configured version number and task time.
  • After the snapshot is completed, the remote rsync command is used to transfer the snapshot directory to the disaster recovery cluster. After the data is compressed during the off-peak period of the main cluster, the data volume is controllable, so that the disaster recovery cluster can back up the specified amount of version information.
  • After a failure occurs, central management and control can specify the recovery to a specific version through RPC instructions in the disaster recovery cluster.
  • The Proxy access layer configures two sets of services through service registration and discovery primary keys. Through dynamic second-level switching, traffic can be directed to a specific version of the cluster, completing second-level switching of service access.

Figure 13 Cluster backup 

5. Multi-activity across clouds

In order to cope with the rapid growth of business needs, the company has increasingly higher requirements for the service stability of cloud vendors. Cloud services in a single computer room cannot meet the company's stability needs. Multi-active across clouds can improve the stability of services. Dual-write and dual-read can Realize that both the active and backup data centers provide external read and write services, which will not cause a waste of data center resources and can achieve cross-regional disaster recovery. We have done some comparative analysis on the solutions commonly used in the industry:

We comprehensively investigated the architectural experience of other manufacturers and proposed the RedKV dual-active design (Replicator as Sidecar Service same-machine deployment) solution, as shown in Figure 14.

  • Deployed on the same machine, the network overhead is small;
  • Sidecar Service is less intrusive to the main service;
  • Deployed individually and easily upgraded.

The architecture is flexible and suitable for active-active architecture of log storage systems. Both Redis and graph database multi-cloud solutions can be adapted and adapted. Specific functional components and practical scenarios will be introduced in detail in subsequent technical articles.

Figure 14 Cross-cloud multi-active architecture 

5. Practical cases

Just like the Xiaohongshu business demand scenario described in Section 2, this section uses a typical business scenario to demonstrate the benefits of RedKV under NoSQL storage.

In the early days when there was no zprofile middle-end, zprofile user and note information were stored in HBase. In order to ensure the data security and service stability of the cluster, HBase adopts a dual-cluster deployment. The writers and readers store data through the HBase Client API. The user data of HBase is tens of TB. Under millions of QPS, the P99 latency is already about 70ms. With the rapid growth of QPS, the latency is getting higher and higher, and the storage cost caused by cluster expansion is also getting higher and higher. Stability guarantee also faces great challenges.

After RedKV1.0 was launched, after more than half a year of polishing, it began to slowly take over the company's core business. The recommended platform architecture team has also begun to build zprofile middle-end services to converge upstream and downstream businesses and provide a standard and unified reading and writing method. Regarding the storage solution, after many business communications between the platform architecture team and the storage team, they finally chose to use RedKV as the underlying storage, which mainly interfaces with two types of business parties: data producers and data consumers. The final middle-end architecture of zprofile is as shown in Figure 15:

  • zprofile-write service provides a unified data writing interface service to the upstream, and provides meta management of users and comparisons. User data is written to the redkv-zprofile-user cluster, and notes and other data are written to the redkv-zprofile-other cluster.
  • zprofile-service provides a unified data consumption service to the downstream, corresponding to offline services with low latency requirements. RedKV itself supports the ability of one-way data replication to provide data scan services through two offline small clusters.

After the overall architecture transformation was completed, using RedKV to connect the same QPS business capabilities, the cost was saved by 36%, and the performance of P99 was improved by about 5 times.

Finally, I would like to recommend a hard-core KV storage practical project to everyone, so that you can easily impress the interviewer!

Click to learn → C++KV storage hard-core project (including source code)

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/133353395