Vivo features large-scale storage practices

This article first appeared in vivo micro-channel public number of Internet technology 
links: https://mp.weixin.qq.com/s/u1LrIBtY6wNVE9lzvKXWjA
Author: Three Decades

This article aims to introduce the features of internal storage vivo practice, evolution and future prospects, serve as a catalyst to attract more good ideas.

First, the needs analysis

AI technology more widely within vivo, which features data plays a vital role for offline training, online estimates and other scenes, we need to design a system to solve the various features of reliable and efficient data storage problems.

1. Characteristics of the feature data

(1) Value Large

Feature data generally contains a lot of fields, resulting in a final deposit to the Value KV particularly, even if it is compressed.

(2) large amount of data storage, high concurrency, high throughput

Wherein the amount of data stored scene to be large, the KV-memory (such as Redis Cluster) is difficult to meet, but also very expensive. Whether online or offline scenarios scene, a large quantity of concurrent requests, Value not small, throughput naturally big. 

(3) high write performance requirements, low latency

Most characteristic scene requires very low latency read and write, and continued stability, less jitter.

(4) does not require the scope of inquiry

Most of the scenes are a single point of random reads and writes.

(5) the timing of filling mass data

Many features of the data has just been calculated when it is present on some of the OLAP-oriented storage products, and regularly counted once, want to have a tool that can turn the feature data in time sync to online KV.

(6) ease of use

Service when accessing the storage system is not much better understanding of the cost.

2. The potential demand

  • Extended to Universal Disk KV, supporting each scene of large-capacity storage requirements

    Our goal is to sea stars, must not be limited to meet the characteristics of the scene.

  • Other support Nosql / Newsql database, resource reuse

    Starting from business needs, we will follow the needs of a wide variety Nosql database, as the database, timing database, object storage, etc., are completely isolated if each product, no resources (code platform capability etc.) multiplexing, is a huge maintenance costs.

  • Maintainability

    First of all the implementation language is not too small, otherwise it will be difficult on recruitment, and tell us the best of technology development stack match.

    We can not rely too much on third-party service component architecture designed to reduce the complexity of operation and maintenance.

3. iceberg storage system

Based on the above demand, we decided compatible Redis protocol, users see only a single version of Redis similar service, but we did a lot behind the reliability of security work.

Second, the program selection

On the program selection, we follow a few basic principles:

  • From open source, on-demand.

  • Internal revenue, brainstorming.

  • Language mainstream, the mainstream architecture.

  • First reliable, high maintainability.

Briefly explain some of our early research program analyzing the advantages and disadvantages:

To be honest, the research is good open source project, but rely on official code and design documents, no in-depth experience, we are difficult to determine an open source product is really suitable for us, proper calibration scheme election race could be better type, but also to some extent reflects our strong execution.

Overall we are to find an optimal balance in all aspects of existing demand, potential demand, ease of use, advanced architecture, performance, maintainability, after the theoretical research and practice after some time, eventually we select the Nebula.

Three, Nebula Profile

Nebula Graph is a high performance, high availability, high reliable, consistent data, open source distributed map database .

1. storing the calculated separation

Storing the calculated separation using Nebula design, computing services stateful and stateless storage services is hierarchical, such that the storage layer may improve data reliability focus, exposure KV simply the interface layer may be calculated directly focused on the required user on computational logic, but also greatly enhance the operation and maintenance deployment flexibility.

However, as the map database, in order to improve the performance, a portion of FIG The Nebula calculation logic sink to the storage layer, which is a more realistic trade-off between flexibility and performance.

2. Strong consistent architecture mainstream

Nebula strong consistent use of Raft, is the multi-copy consistency of the mainstream approach, and this has been initially achieved by the Raft Jepsen linear conformance testing, as a start near the open source project to increase the user's confidence is helpful.

3. Scalable

Scale thanks to its ability to Nebula Hash-based Multi-raft is achieved while own scheduler (Balancer) for load balancing, architecture and implementation are relatively simple (or at least for now), to use a low cost.

4. Easy Maintenance

Nebula kernel uses the C ++ implementation, with our technology development infrastructure stack compare match. After assessment, Nebula basic platform capabilities (such as monitoring interface, deployment model) is simple to use, with our own platform can be a good butt.

Do good abstract code implementation, the flexibility to support multiple storage engines, and later laid a good foundation for performance optimization features of the scene for us.

Four, Nebula Raft Profile

Nebula mentioned above is dependent on strong Raft ensure a consistent, briefly outline the characteristics of Nebula Raft:

1. Select the main term

A Raft Group's life cycle is a continuous one after another term of office, each of which will elect a term of office beginning Leader, the other members of the Follower, a term of only one Leader, Leader if tenure is not available, we will immediately go to the next a term of office, the election of a new Leader. This mechanism makes Raft Strong Leader of the project implementation difficulty far below its patron deity - Paxos.

2. log replication, compression

Standard Raft implementation, each write request from the client will be converted into "Operation Log" wrote wal file, Leader after the Japanese operation to update their own state machine, asynchronous replication will take the initiative to log all Follower, it was not until more than half of Follower response was returned to the client successfully written.

Actual operation, wal files will become increasingly large, if not a reasonable wal log recovery mechanisms, wal file will soon take up the entire disk, the recovery mechanism is the log compression (Log Compaction). Log Compaction Nebula implementation is relatively simple, the user only needs one wal_ttl parameters, you can without compromising the correctness of the cluster, the files occupy space wal control in a stable range.

Nebula achieved Raft batch and pipeline mechanisms to support Follower to Leader of the batch and submit logs out of order, under high concurrency scenarios, can effectively improve the overall throughput clusters.

3. Members Change

Achieve similar with typical Raft, where mention Snapshot focus mechanism Nebula Raft of.

When a member of the Group increased Raft, nodes need to get new members from the current Leader of all log into both its own state machine, which is a resource overhead can not be underestimated, resulting in greater pressure on the Leader. For this purpose the general Raft Snapshot will provide a mechanism to solve this node expansion performance issues, and timeliness issue node failure recovery.

Snapshot, namely the Leader's own state machine labeled as a "mirror image" stored separately, in Nebula Raft implementation, the "mirror" is Rocksdb instance (that is, the state machine itself), when a new member joins, Leader calls Rocksdb scan the entire instance of Iterator , during the reading of the value of new members to the wholesale division, the final completion of the entire process Snapshot copy of.

4. Multi-raft implemented

If a cluster has only one Raft Group, scale is difficult to achieve by adding machine, a very limited application scenarios, natural method is to think of splitting the data cluster of a plurality of different Raft Group, here introduces two new problems: (1) how the data slice (2) how to slice evenly distributed cluster.

Multi-raft is achieved in a challenging and very interesting thing, the industry there are two kinds of mainstream implementation, one is the Hash-based, one is Region-based, have advantages and disadvantages, in most cases, the former is relatively simple effective, Nebula currently using Hash-based model, which is what we need, but the map for the scene, there is no further follow-up planning, community dynamics require continuous attention.

Fifth, the storage platform features introduced

1. System Architecture

Nebula in the original architecture, based on the increased number of components, including Redis Proxy, Rediscluster Proxy platform and related components.

Examples of meta-information is stored Meta entire cluster, including routing rules data piece, space information, etc., which itself is a Raft Group.

Storage Examples of actual data stored in the node is assumed that a cluster corresponding to a plurality of m slices Raft Group, each corresponding to n copies Raft Group, The Nebula is the m * n copies evenly distributed to the plurality of instances Storage and Leader seeks to number each instance is also similar.

Graph drawing API Console is an example of a service provider as well as the entire cluster, stateless.

Examples Redis Redis compatible protocol, to achieve the data structure portion Redis native, stateless.

Rediscluster example compatible with Redis Cluster protocol, stateless.

2. Performance Optimization

(1) Cluster Tuning

Access actual production business, often need to adjust parameters for different scenarios, at this early work takes up a lot of time, but it also accumulated valuable experience for us.

(2)WiscKey

Most features of the previously mentioned scenarios Value was relatively large, relying solely on Rocksdb cause serious write amplification due to frequent trigger Compaction logic, and each Compaction is going to be the Key and Value swept out of the disk, in large Value under the scenario, this overhead is very terrible. To this end academia proposed some solutions, WiscKey practicality and widely recognized, industry also landed its open source implementation (Titandb).

Titandb For details refer to their  official document , in simple terms, is the transformation of Rocksdb, compatible external interfaces, reservations LSM-tree, new BlobFile storage, Key Value stored separately, Key deposit LSM-tree, Value deposit BlobFile, dependent on random disk SSD read and write performance, range queries sacrificing performance, reduce write amplification in large Value scene.

Nebula thanks to support multiple storage engine design, Titandb very easy to integrate into the Nebula Storage, in the actual production, really gives us a good income in performance.

3. TTL mechanism

Whether Rocksdb, or Titandb, are compatible Compaction Filter interface, that is when Compaction will call this Filter to determine whether the specific data needs to be filtered out. We grow in the Value actually written into the Storage of TTL, in Compaction Filter when scanning each Value, Value to determine whether the extracted TTL expired, and if so, delete the corresponding Key-Value pair.

However, in practice we found, Titandb when Compaction after significant if Value is isolated BlobFile, Filter Value is not read specific (only stay in the LSM-tree in the small Value can be read). This has resulted in great disadvantage to us TTL mechanism, resulting in stale data is no way to recover. To this end, we did a little special treatment, when large Value is isolated BlobFile, LSM-tree there will save Dui Key-Index, Index Value is in the position in BlobFile, we try to TTL species to the Index, making Filter time to resolve the TTL, in order to achieve all the physical delete outdated data.

4. easy to use

Ease of use is a sign of maturity database, it is a big issue.

From the perspective of different users, will come out of a different set of requirements, user roles may include operation and maintenance dba, business R & D engineers, operation and maintenance engineers, etc., and ultimately we hope to exceed expectations in every perspective can achieve true high-use storage products. Here are some simple practices we have on ease of use:

(1) compatible protocol redis

We transformed the beauties of open source KVrocks (a stand-alone disk-based Rocksdb KV products compatible redis agreement), rely Nebula C ++ version of the Storage Client, dependent on the underlying logic of Rocksdb replaced Nebula Storage KV logical interface to read and write in order to achieve redis a stateless protocol compatibility layer (Proxy), and we implemented some additional commands based on actual needs. Of course, we just realized some scenes for the feature redis command, to be compatible with all instructions in redis KV distributed on the basis of the need to consider a distributed transaction, where I first secrecy, so stay tuned.

(2) from the support introduced into the batch data KV Hive

Scenes feature, this feature is also a reflection of the ease of use, Nebula for the current data showing the structure of data has been achieved from the Hive guide, a little transformation can KV-compatible format.

(3) operation and maintenance platform

We maintain a public pre-configured on all meta-information center line of the cluster, and landed a few simple operations, such as a key deployment cluster, a cluster key to uninstall, regular monitoring report, timed commands to check the correctness, timing instances health testing, Timing cluster load monitoring, etc., to meet the basic needs of daily operation and maintenance. At the same time, internal vivo in building a fully functional DBaaS platform, has actually support a lot of DB platform operation and maintenance of the product, including redis, mysql, elasticsearch, mongodb etc., greatly enhance the efficiency of data management operations, therefore, the final features keep up the storage platform is fully integrated, co-evolution, continue to achieve ease of use and robustness breakthrough.

5. Disaster Recovery

(1) Regular cold standby

Nebula itself provides cold standby mechanism, we only need to design personalized regular backup strategy, you can better meet business needs, not described in detail here, are interested can look at Nebula's  snapshot mechanism cluster .

(2) Real-time Tracking

Hot Standby landed a total of two phases:

Phase: relatively simple, consider only the incremental backups, and tolerate lossy.

Currently KV main service features scenes (scenes or cache), the reliability of the data requirements are not particularly high, and the time data resides in the store will not be long, TTL will soon be cleared away. For this purpose does not support hot standby stock data backup scheme.

As for incremental backups, that is, Proxy layer to "write request" asynchronous write once again to the standby cluster, cluster master or continue to perform synchronous write, as long as the Proxy cpu resources are sufficient, will not affect the read and write performance of the main cluster itself. Here there is a risk of data loss, such as Proxy asynchronous did not finish, the process suddenly hung up, then prepare the cluster will lose a bit of data, but as mentioned before, most of the features of the scene (or scenes cache) to such a degree the data loss is tolerable.

Phase II: both to ensure an incremental backup, but also to ensure the stock back up.

Nebula Raft introduced Learner, it is also a copy of the Group of Raft, but neither participate in the primary election, does not affect the majority submitted, it just quietly receiving a request from the Leader of the log copy. Like other Follower, Learner once hung, Leader will continue to retry replication logs to Learner, Learner until restart recovery.

With this mechanism, to achieve the stock back up to a simple change, we can implement a disaster recovery component, disguised as a Learner, linked to the Raft Group, the then members of the change mechanism will ensure that the stock of Leader of Raft of data and increment in the form of log data can be synchronized to the disaster recovery assembly, while the other side of the assembly dependent Nebula Storage Client converting the log data to the write request source application to a cluster disaster recovery.

6. Cross double living room

Double living floor is divided into two phases:

Phase I: without regard to conflict resolution, does not guarantee that the final agreement between the clusters.

This version of the same simple realization, it is understood that two clusters each other disaster, for there are city-active, failover requirements, for less demanding business or eventual consistency helpful.

Phase II: the introduction of CRDT deal with conflict and achieve a final agreement.

This version requires reliability is relatively high, with two complex disaster recovery capability, access to cluster write request in Learner log in.

Under normal circumstances seki, KV two clusters may be distributed in different rooms, a unit of each business service will read the data from this room KV, KV two different rooms synchronized with each other changes. If two KV updated with a Key, and synchronized to each other, then it should be how to deal with conflict?

The most straightforward approach is the most "late" to write data to update two KV, ensure that the final agreement, the "night" does not mean everyone - in an absolute sense, but according to the time stamp of the occurrence of a write operation, two with a Key write operation can take a room to the respective time stamps, but not necessarily clock synchronization between the engine room, which may cause the actual operation occurs first time stamp may be greater, but our ultimate goal is to achieve consistent, not synchronized with the clock mechanism rivalry, so no problem. For this idea, well-known program CRDT eventual consistency has been given the appropriate standard implementation.

KV actual deposit of only String data type, corresponding to the CRDT in the Register data structures, one of which is to achieve Op-based LWW (Last-Write-Wins) Register, by definition, is the most "late" wrote Value to be eventually consistent state, the algorithm has the following prototype:

For CRDT interested can look at other information online, not described in detail here.

Fortunately, the interior has been achieved in vivo on Redis Cluster of CRDT Register, and provides a reliable transmission assembly room of data protection across, making the new KV storage can stand on the shoulders of giants. Note that, KV line mset large write request, and only supports CRDT Register Set of processing a single request collision, so Learner-active component in the received request requires Leader Batch Write one disassembled into a Set command, and then synchronize to Peer cluster.

VI Future

1. The common extended to store KV

We project characteristics of the storage time, it made common goal to KV storage, databases become more powerful base. But to make a generic KV store, it also needs a lot of work to implement, including, platform capacity, improve the reliability of low-cost aspects. Fortunately, the industry has a lot of good practice and give us a great reference value.

2. continue to improve platform capabilities

The simplest, internal reference vivo as well as major Internet companies redis platform management practices, construction of new platform capabilities KV there are many things to do, and follow-up will combine with intelligent DB operation and maintenance, more imagination Big.

3. To continue to improve the accuracy of checking mechanism

Data reliability and correctness of the existence and prosperity is a database product, we need to continue to improve the appropriate verification mechanism.

At this stage we have not committed to law financial data reliability level, we will continue towards this direction, a number of features to meet the current scene and the scene cache is feasible.

We have gradually introduced a number of open source tools chaos, hoping to continue to dig out the potential problems of the system, to provide users with more reliable data storage services.

4. enhanced scheduling capabilities

The core is built around a distributed database storage, computing, scheduling three topics including the importance of visible scheduling, load balancing is one part of the current Hash-based fragmentation rules can be changed Region-based follow-up of slices rule? You can combine with k8s KV cloud storage products to build native? Able to make data distribution adjustment smarter, more automated ...... we'll see.

The hot and cold data separation

Nature or cost and performance trade-offs, especially for some large-scale clusters, 90% of the data may be rarely visited, even if the data saved to flash memory, but also a waste of resources. On the one hand we want the data to be accessed frequently can get a better read and write performance, on the other hand we want to maximize cost savings.

A more direct approach, is to keep the heat and flash memory on the data, some of the cold data is stored frozen some cheaper medium (such as mechanical disk), which need to have the ability to judge the system itself, can continue to differentiate dynamically the data which is hot, which are cold data.

6. Support more types of storage engine

Now supports the Rocksdb and Titandb, follow-up will consider introducing more types of storage engines, such as pure memory, or flash memory-based AEP and other new hardware products storage engine.

7. Support for remote cold standby HDFS

For the online scenario, data backup is very important, and the current snapshot backup Nebula has supported local cluster level, but the machine hung up, or there is a risk of loss of large amounts of data, we will consider the data to a remote cold standby, such as HDFS. HDFS is not as long as the cost to mount the directory, cluster snapshot dump to the specified directory can it? We will do further thought and design.

8. SPDK disk read and write

The actual test tells us, it is also dependent on nvme disk, use the stand-alone SPDK enhance throughput than without SPDK nearly doubled. Bypass Kernel SPDK This program has been a general trend, the disk io easily become a bottleneck scenario, using SPDK can effectively improve resource utilization.

9. KV SSDs

Given SPDK Bypass Kernel advantages, the industry proposed a new solution (KV SSD).

Rocksdb LSM-tree realize, Compaction mechanism will cause serious write amplification, and KV SSD provides native KV interface, compatible Rocksdb API based on new data record can be written directly into the SSD, you do not need to be repeated Compaction operation, thereby reducing to an enlarged Rocksdb write a new technology, it is a very worth trying.

10. The support database in FIG.

Our product has been ordered KV Nebula, one important reason is to prepare for the map database, we have been trying to access some diagramming needs of business, after the open source community wanted to be able to cooperate and build a leading map database capabilities .

11. Timing database support

5G and in the era of things, the timing database plays a very important role.

This field Influxdb current leading, but the open source version does not support distributed, it depends only on one for the design of single-series data storage engine (TSM), very limited practical value.

Our products provide a ready-KV distributed replication capabilities, standardized platform capabilities, high-availability security measures, we hope to be able to reuse them as much as possible.

Combine is not possible to consider the TSM with the ability to do a distributed replication integration, plus Sharding strategy timing-friendly scene, building a highly available distributed storage engine timing, replace the stand-alone storage layer of open source InfluxDB.

12. A supporting object stores metadata storage

Metadata is critical to storage "object storage" is, since we have provided a powerful KV storage products, is not it can be reused up, operation and maintenance, and research and development to reduce the burden of maintaining it?

Seven, finally

We need to continue to practice the process of coordination of resources, requirements gathering, iterative product, and strive to access more scenes, collect more demand and better polish our product, enter a virtuous circle as soon as possible, stating their feelings and experiences under:

More Stay tuned  vivo Internet technology  micro-channel public number

Note: Please reprint the article with the Micro Signal: labs2020  contact.

Guess you like

Origin www.cnblogs.com/vivotech/p/12525513.html
Recommended