How to build a scalable and highly available key-value storage system

This article was first published on the official account: More AI (power_ai), welcome to pay attention, programming and AI dry goods will be delivered in time!

A key-value store, also known as a key-value database, is a type of non-relational database. Each unique identifier is stored as a key, along with its associated value. This pairing of data is called a "key-value" pair.

In a key-value pair, the key must be unique, and the value associated with the key can be accessed through the key. Keys can be plain text or hashes. Short keys are better for performance reasons. What does the key look like? Here are some examples:

  • Plain text key: "last_logged_in_at"
  • Hash key: 253DDEC4

The values ​​in a key-value pair can be strings, lists, objects, etc. In key-value stores, values ​​are usually treated as an opaque object, such as Amazon dynamo [1], Memcached [2], Redis [3], etc.

The following is a piece of data in a key-value store:

image-20230525200227341

In this chapter, you are asked to design a key-value store that supports the following operations:

  • put(key, value) // Insert the "value" associated with the "key"
  • get(key) // get the "value" associated with the "key"

Understand the problem and establish design scope

There is no perfect design. Each design achieves a specific balance, with regard to reads, writes, and memory usage tradeoffs. The tradeoff that still needs to be made is between consistency and availability. In this chapter, we design a key-value store with the following features:

  • The size of key-value pairs is small: less than 10 KB.
  • Capable of storing big data.
  • High Availability: The system responds quickly also during failures.
  • High scalability: The system can scale to support large data sets.
  • Auto Scaling: Automatically add/remove servers based on traffic.
  • Adjustable consistency.
  • low latency.

Single Server Key-Value Store

It is easy to develop a key-value store that resides on a single server. The intuitive approach is to store key-value pairs in a hash table, keeping everything in memory. Although memory access is fast, it may not be possible to fit everything into memory due to space constraints. Two optimizations can be made to store more data in a single server:

  • data compression
  • Store only frequently used data in memory, the rest on disk

Even with these optimizations, a single server can quickly reach its capacity. A distributed key-value store is needed to support big data.

Distributed key-value store

A distributed key-value store is also known as a distributed hash table, which distributes key-value pairs across multiple servers. When designing distributed systems, it is important to understand the CAP ( C Consistency, A Availability, P Partition Tolerance) theory.

CAP theorem

The CAP theorem states that it is impossible for a distributed system to simultaneously provide more than two of the following three guarantees: consistency, availability, and partition tolerance. Let's clear up some definitions.

Consistency : Consistency means that all clients see the same data at the same time, no matter which node they are connected to.

Availability : Availability means that any client requesting data will get a response, even if some nodes are down.

Partition Tolerance : A partition indicates a loss of communication between two nodes. Partition tolerance means that the system continues to function in the event of a network partition.

The CAP theorem states that one of the three properties must be sacrificed in favor of the other two, as shown in Figure 6-1.

image-20230525200252771

Today, key-value stores are classified according to the two CAP characteristics they support:

CP (Consistency and Partition Tolerance) systems : CP key-value stores support consistency and partition tolerance at the expense of availability.

AP (Availability and Partition Tolerance) Systems : AP key-value stores support availability and partition tolerance at the expense of consistency.

CA (Consistency and Availability) Systems : CA key-value stores support consistency and availability at the expense of partition tolerance. Since network failures are inevitable, distributed systems must tolerate network partitions. Therefore, there is no CA system in practical applications.

What you read above is mainly the definition part. To make it easier to understand, let's look at some concrete examples. In a distributed system, data is usually replicated multiple times. Suppose data is replicated on three replica nodes n1 , n2 and n3 , as shown in Figure 6-1.

ideal situation

In an ideal world, network partitions would never occur. Data written to n1 is automatically copied to n2 and n3 . Both consistency and availability are achieved.

image-20230525200306382

Real World Distributed Systems

In a distributed system, partitions are unavoidable, and when partitions occur, we must choose between consistency and availability. In Figure 6-3, n3 is down and unable to communicate with n1 and n2 . If a client writes data to n1 or n2 , the data will not propagate to n3. If data is written to n3 but has not been propagated to n1 and n2 , then n1 and n2 will have stale data.

image-20230525200323569

If we choose consistency over availability (CP system), we have to block all write operations to n1 and n2 to avoid data inconsistency among these three servers, which would make the system unavailable. Banking systems typically have extremely high consistency requirements. For example, it is critical for a banking system to display up-to-date balance information. If an inconsistency occurs due to a network partition, the banking system returns an error until the inconsistency is resolved.

However, if we choose availability over consistency (AP system), even though the system may return stale data, it will continue to accept reads. For write operations, n1 and n2 will continue to accept writes, and data will be synced to n3 when the network partition is resolved .

Choosing the correct CAP guarantee for your usage scenario is an important step in building a distributed key-value store. You can discuss this with the interviewer and then design the system based on the discussion.

System Components

In this section, we discuss the following core components and technologies used when building a key-value store:

  • data partition
  • data replication
  • consistency
  • inconsistency resolution
  • Troubleshooting
  • System Architecture Diagram
  • write path
  • read path

The following is largely based on three popular key-value storage systems: Dynamo [4], Cassandra [5] and BigTable [6].

data partition

For large applications, it is not feasible to fit the complete dataset into a single server. The easiest way to accomplish this task is to divide the data into smaller partitions and store them on multiple servers. When partitioning data, there are two challenges:

  • Distribute data evenly among multiple servers.
  • Minimize data movement when adding or removing nodes.

Consistent hashing, discussed in Chapter 5, is a great solution to these problems. Let's review again how consistent hashing works at a high level.

  • First, put the server on a hash ring. In Figure 6-4, eight servers, denoted s0, s1, ..., s7 , are placed on the hash ring.
  • Next, a key is hashed onto the same ring, and it is stored on the first server encountered as you move clockwise. For example, key0 is stored in s1 according to this logic .

image-20230525200347075

Using consistent hashing to partition data has the following advantages:

Autoscaling: Servers can be added and removed automatically based on load.

Heterogeneity: The number of virtual nodes of a server is proportional to the server capacity. For example, servers with larger capacity are assigned more virtual nodes.

data replication

For high availability and reliability, data must be replicated asynchronously across N servers, where N is a configurable parameter. These N servers are selected according to the following logic: after a key is mapped to a certain position on the hash ring, walk clockwise from that position, and select the first N servers on the ring to store data copies . In Figure 6-5 ( N = 3 ), key0 is replicated at s1, s2, and s3 .

image-20230525200401723

In the case of virtual nodes, the first N nodes on the ring may be owned by fewer than N physical servers. To avoid this problem, we only select unique servers when performing the clockwise walk logic.

Due to power failures, network problems, natural disasters, etc., nodes in the same data center often fail at the same time. For better reliability, replicas are placed in different data centers, which are connected by high-speed network.

consistency

Since data is replicated across multiple nodes, it must be synchronized across the replicas. Quorum (quorum) consensus can guarantee the consistency of read and write operations. First let's define a few concepts.

N = number of replicas

W =Write quorum of size W. In order for a write operation to be considered successful,acknowledgments for the write operation must be obtained from W replicas.

R =read quorum of size R. In order to consider a read operation successful, the read operation must wait for responses from at least R replicas.

Consider the following example for N = 3 in Figure 6-6 .

image-20230525200418446

W = 1 does not mean that data is written to a server. For example, for the configuration in Figure 6-6, data is replicated on s0 , s1 , and s2 . W = 1 means that the coordinator must receive at least one acknowledgment before considering the write operation successful. For example, if we get an acknowledgment from s1 , we don't need to wait for acknowledgments from s0 and s2 . The coordinator acts as a proxy between clients and nodes.

The configuration of W, R and N is a typical trade-off between latency and consistency. If W = 1 or R = 1 , the operation returns quickly because the coordinator only needs to wait for a response from any one replica. If W or R > 1 , the system provides better consistency; however, queries will be slower because the coordinator has to wait for a response from the slowest replica.

If W + R > N , strong consistency is guaranteed because there must be at least one overlapping node with up-to-date data to ensure consistency.

How to configure N, W , and R to suit our use case? Here are some possible settings:

If R = 1 and W = N , the system is optimized for fast reads.

If W = 1 and R = N , the system is optimized for fast writes.

If W + R > N , strong consistency is guaranteed (usually N = 3, W = R = 2 ).

If W + R <= N , strong consistency cannot be guaranteed.

Depending on the requirements, we can adjust the values ​​of W, R, N to achieve the desired level of consistency.

consistency model

The consistency model is another important factor to consider when designing a key-value store. A consistency model defines the degree of data consistency, and there are many possible consistency models:

  • Strong consistency: the value returned by any read operation corresponds to the result of the most recently updated written data item. Clients never see stale data.
  • Weak consistency: Subsequent read operations may not see the latest updated value.
  • Eventual consistency: This is a specific form of weak consistency. Given enough time, all updates are propagated and all replicas are consistent.

Strong consistency is usually achieved by enforcing replicas to not accept new read/write operations until every replica agrees with the current write. This approach is not ideal for highly available systems as it can block new operations. Dynamo and Cassandra employ eventual consistency, which is our recommended consistency model for key-value stores. From concurrent writes, eventual consistency allows inconsistent values ​​to enter the system and forces clients to read values ​​to reconcile. The next section explains how version control works to achieve coordination.

Inconsistency Resolution: Version Control

Replication provides high availability, but leads to inconsistencies between replicas. Versioning and vector locks are used to resolve inconsistencies. Versioning refers to treating each data modification as a new immutable version of the data. Before we talk about version control, let's explain how inconsistencies happen with an example:

As shown in Figure 6-7, replica nodes n1 and n2 have the same value. Let's call this value the original value. Server 1 and Server 2 fetch the same value for *get("name")* operations.

image-20230525200443156

Next, Server 1 changes the name to "johnSanFrancisco" and Server 2 changes the name to "johnNewYork", as shown in Figure 6-8. These two changes are made simultaneously. We now have conflicting values ​​called versions v1 and v2 .

image-20230525200456151

In this example, the original value can be ignored, since the modification is based on it. However, there is no clear way to resolve conflicts between the last two versions. To solve this problem, we need a versioning system that can detect conflicts and reconcile them. Vector clocks are a common technique to solve this problem. Let's see how the vector clock works.

A vector clock is a *[server, version]* pair associated with a data item. It can be used to check if a version is before, after, or in conflict with other versions.

Suppose a vector clock is represented by D([S1, v1], [S2, v2], …, [Sn, vn]) , where D is a data item, v1 is a version counter, s1 is a server number, etc. If a data item D is written to server Si , the system must perform one of the following tasks.

  • If *[Si, vi] exists, increment vi*.
  • Otherwise, create a new entry *[Si, 1]*.

The abstract logic above is explained with concrete examples in Figure 6-9.

image-20230525200550038

  1. The client writes data item D1 to the system, and the write operation is handled by server Sx , which now has vector clock D1[(Sx, 1)] .

  2. Another client reads the latest D1 , updates it to D2 , and writes it back. D2 is derived from D1 , so it overrides D1 . Suppose the write operation is handled by the same server Sx , now Sx has vector clock D2([Sx, 2]) .

  3. Another client reads the latest D2 , updates it to D3 , and writes it back. Suppose the write operation is handled by server Sy , and now Sy has vector clock D3([Sx, 2], [Sy, 1])) .

  4. Another client reads the latest D2 , updates it to D4 , and writes it back. Suppose the write operation is handled by server Sz , and now Sz has D4([Sx, 2], [Sz, 1])) .

  5. When another client reads D3 and D4 , it finds a conflict caused by data item D2 being modified by both Sy and Sz . Conflicts are resolved by the client and updated data is sent to the server. Suppose the write operation is handled by Sx , and now Sx has D5([Sx, 3], [Sy, 1], [Sz, 1]) . We'll explain how to detect conflicts later.

Using vector clocks, it is easy to tell that version X is an ancestor of version Y (i.e. there is no conflict) if the version counters for each participant in version Y 's vector clock are greater than or equal to the counters in version X. For example, the vector clock *D([s0, 1], [s1, 1]) is an ancestor of D([s0, 1], [s1, 2])*. Therefore, no conflicts are recorded.

Likewise, version X can be judged to be a sibling of Y if there is any participant whose counter in Y's vector clock is less than the corresponding counter in X (that is, there is a conflict). For example, the following two vector clocks indicate a conflict: D([s0, 1], [s1, 2]) and D([s0, 2], [s1, 1]).

Although vector clocks can resolve conflicts, there are two significant disadvantages. First, the vector clock increases the complexity of the client, since it needs to implement conflict resolution logic.

Second, the *[server:version]* pairs in the vector clock can grow rapidly. To solve this problem, we set a threshold for the length, if the limit is exceeded, the oldest pair is deleted. This can lead to inefficiencies when mediating, since offspring relationships cannot be accurately determined. However, according to the Dynamo paper [4], Amazon has not encountered this problem in production; thus, it may be an acceptable solution for most companies.

Handle failure

As with any large-scale system, failures are not only inevitable, but common. Handling failure scenarios is very important. In this section, we first introduce techniques for detecting failures. Then, we'll discuss common troubleshooting strategies.

Fault detection

In a distributed system, it is not enough to believe that a server is down just because another server says it is down. Typically, at least two independent sources of information are required to flag a server as down.

As shown in Figure 6-10, omnidirectional multicast is a simple and straightforward solution. However, this is inefficient when there are many servers in the system.

image-20230525200611485

A better solution is to use a distributed fault detection method like the gossip protocol. The gossip protocol works as follows:

  • Each node maintains a node membership list, which contains member IDs and heartbeat counters.
  • Each node periodically increments its heartbeat counter.
  • Each node periodically sends heartbeats to a random set of nodes, which in turn propagate to another set of nodes.
  • Once the node receives the heartbeat, the membership list will be updated with the latest information.
  • If the heartbeat does not increase within a predetermined period of time, the member is considered offline.

image-20230525200627631

As shown in Figure 6-11:

  • Node s0 maintains the node membership list shown on the left.
  • Node s0 notices that the heartbeat counter of node s2 (member ID = 2) has not been incremented for a long time.
  • Node s0 sends a heartbeat containing information about s2 to a set of random nodes. Once other nodes confirm that s2 's heartbeat counter has not been updated for a long time, node s2 is marked as down, and this information is propagated to other nodes.

Handle temporary failures

After a failure is detected through the gossip protocol, the system needs to deploy certain mechanisms to ensure availability. In a strict quorum approach, as shown in the quorum consensus section, read and write operations may block.

A technique called "loose arbitration" [4] is used to improve availability. Instead of enforcing quorum requirements, the system selects the top W healthy servers on the hash ring for write operations and the top R healthy servers for read operations. Offline servers will be ignored.

If one server is unavailable due to network or server failure, another server will temporarily handle requests. When the downed server comes back up, the changes are pushed back for data consistency. This process is called Hinted Handoff. Since s2 in Figure 6-12 is unavailable, s3 will temporarily handle the read and write operations. When s2 comes back online, s3 will hand over the data back to s2 .

image-20230525200643811

Dealing with permanent failures

Hinted Handoff is used to handle temporary failures. So what if the replica is permanently unavailable? To handle such cases, we implement an anti-entropy protocol to keep replicas in sync. Anti-entropy involves comparing every piece of data on replicas and updating each replica to the latest version. We use a structure called a Merkle tree to detect inconsistencies and minimize the amount of data transferred.

Quoting from Wikipedia [7]: "A hash tree or Merkle tree is a tree in which each non-leaf node is marked with a hash of the label or value (in the case of a leaf node) of its child nodes. Ha Histree allows for efficient and secure verification of the contents of large data structures."

Assuming the keyspace is from 1 to 12, the following steps show how to build a Merkle tree. Highlighted boxes indicate inconsistencies.

Step 1: Divide the keyspace into buckets (4 in our example) as shown in Figure 6-13. A bucket is used as the root level node to maintain the depth limit of the tree.

image-20230525200700209

Step 2: Once the bucket is created, hash each key in the bucket using a uniform hashing method (Figure 6-14).

image-20230525200711090

Step 3: Create a hash node for each bucket (Figure 6-15).

image-20230525200720689

Step 4: Build the tree up to the root node by computing the hashes of the child nodes (Figure 6-16).

image-20230525200732461

To compare two Merkle trees, first compare the root hashes. If the root hashes match, then both servers have the same data. If the root hashes are different, then the left child hashes are compared, then the right child hashes. You can walk the tree to find out which buckets are not synced, and sync only those buckets.

With Merkle trees, the amount of data that needs to be synchronized is proportional to the difference between the two replicas, not the amount of data they contain. In a real world system, the bucket size is quite large. For example, a possible configuration is to have a million buckets per billion keys, so each bucket contains only 1000 keys.

Dealing with Data Center Failures

Data center failures may occur due to power outages, network outages, natural disasters, etc. In order to build a system that can handle data center failures, it is important to replicate data to multiple data centers. Even if one data center is completely offline, users can still access data through other data centers.

System Architecture Diagram

Now that we have discussed the different technical considerations when designing a key-value store, we can turn our attention to the architectural diagram, shown in Figure 6-17.

image-20230525200746256

The main features of the architecture are as follows:

  • Clients communicate with the key-value store through a simple API: get(key) and put(key,value) .
  • A coordinator is a node that acts as a proxy between clients and the key-value store.
  • Nodes are distributed on the ring using consistent hashing.
  • The system is fully decentralized, adding and moving nodes can be done automatically.
  • Data is replicated on multiple nodes.
  • Since each node has the same set of responsibilities, there is no single point of failure.

Since the design is decentralized, each node performs many tasks, as shown in Figure 6-18.

image-20230525200759966

write path

Figure 6-19 explains what happens when a write request is directed to a specific node. Note that the proposed design of the write/read path is mainly based on the architecture of Cassandra [8].

image-20230525200811822

  1. Write requests are persisted in the commit log file.

  2. Data is kept in memory cache.

  3. When the memcache is full or reaches a predefined threshold, the data is flushed to the on-disk SSTable [9]. Note: A sorted string table (SSTable) is a sorted list of pairs. For readers who want to know more about SSTable, please refer to Reference [9].

read path

After a read request is directed to a specific node, it is first checked whether the data is in the memory cache. If yes, the data will be returned to the client, as shown in Figure 6-20.

image-20230525200822818

If the data is not in memory, it will be retrieved from disk. We need an efficient way to find out which SSTable contains this key. Bloom filters [10] are usually used to solve this problem.

Figure 6-21 shows the read path when the data is not in memory.

image-20230525200837584

  1. The system first checks to see if the data is in memory. If not, go to step 2.
  2. If the data is not in memory, the system checks the bloom filter.
  3. Bloom filters are used to determine which SSTables are likely to contain the key.
  4. SSTable returns the result of the dataset.
  5. The results of the dataset are returned to the client.

Summarize

This chapter introduces many concepts and techniques. To help you recall, the following table summarizes the features used by distributed key-value stores and the corresponding technologies.

image-20230525200853259

References

[1] Amazon DynamoDB: https://aws.amazon.com/dynamodb/

[2] memcached: https://memcached.org/

[3] Redis: https://redis.io/

[4] Dynamo: Amazon's highly available key-value store: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

[5] Cassandra: https://cassandra.apache.org/

[6] Bigtable: A Distributed Structured Data Storage System: https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

[7] Merkle tree: https://en.wikipedia.org/wiki/Merkle_tree

[8] Cassandra architecture: https://cassandra.apache.org/doc/latest/architecture/

[9] SStable: https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/

[10] Bloom filter https://en.wikipedia.org/wiki/Bloom_filter

Hello, I am Shisan, a veteran driver who has been developing for 7 years, and a foreign company for 5 years in the Internet for 2 years. I can beat Ah San and Lao Mei, and I have also been ruined by PR comments. Over the years, I have worked part-time, started a business, took over private work, and mixed upwork. Made money and lost money. Along the way, my deepest feeling is that no matter what you learn, you must keep learning. As long as you can persevere, it is easy to achieve corner overtaking! So don't ask me if it's too late to do what I do now. If you still have no direction, you can follow me [public account: More AI (power_ai)], where I will often share some cutting-edge information and programming knowledge to help you accumulate capital for cornering and overtaking.

Guess you like

Origin blog.csdn.net/smarter_AI/article/details/131819407