KEY/VALUE-tair

What is Tair

      tair is a distributed key/value storage engine developed by Taobao. Tair is divided into persistent and non-persistent usage.

     A non-persistent tair can be seen as a distributed cache.

    Persistent tair stores data on disk, tair can configure the number of data backups, tair automatically puts different backups of a data on different hosts, when an exception occurs on a host and cannot provide services normally, its The backups will continue to be in service.

tair's overall architecture

As a distributed system, tair consists of a central control node and a series of service nodes. We call the central control node the config server. The service node is the data server.

 The config server is responsible for managing all the data servers and maintaining the status information of the data servers. The config server is a control point, and it is a single point. Currently, one master and one backup are used to ensure its reliability.

 The data server provides various data services to the outside world, and reports its status to the config server in the form of heartbeat. All data server statuses are equivalent.

 The basic concept of tair

config server

1) Obtain the information of the surviving nodes in the cluster through maintenance and dataserver heartbeat
2) Build the data distribution table in the cluster according to the information of the surviving nodes.
3) Provide query services for data distribution tables.
4) Schedule data migration and replication between dataservers.

From the overall architecture diagram of Tair, configserver is very similar to the central node in a traditional distributed cluster. The entire cluster services depend on the configserver to work properly.

However, Tair's configserver is a lightweight central node. In most cases, the unavailability of configserver will not affect the services of the cluster.

The interaction between Tair users and configserver is mainly to obtain the comparison table of data distribution. When the client obtains the comparison table, it will cache this table, and then determine the data storage node by checking this table, so the request does not need to interact with the configserver. This makes Tair's external services independent of configserver, so it is not a central node in the traditional sense.

The comparison table maintained by configserver has a version number, and the version number increases every time a new table is generated. When the status of a data node changes (such as adding a new node or a node is unavailable), configserver will regenerate the comparison table based on the currently available nodes, and synchronize the new table to the data node through the heartbeat of the data node.

When the client requests the data node, the data node will put the version number of its own comparison table into the response and return it to the client every time. After the client receives the response, it will return the version number returned by the data node and its own version number. Compare, if not the same, then actively communicate with configserver to request a new comparison table.

Therefore, the client does not need to keep a heartbeat with the configserver in order to update the comparison table in time. This makes it unnecessary for the client to communicate with the configserver under normal circumstances, and even if the configserver is unavailable, it will not have a big impact on the services of the entire cluster.

Only when the configserver is unavailable, and there is a client that needs to be initialized, the client will not be able to get the comparison table information, which will make the client unable to work normally.

data server

  1) Provide storage engine
   2) Accept operations such as put/get/remove of client
   3) Perform data migration, replication, etc.
   4) Plug-in: process some custom functions when accepting requests
   5) Access statistics

invalid Server

1) After receiving requests such as invalid/hide from the client, perform delete/hide operations on clusters belonging to the same group (independent cluster deployment in two computer rooms) to ensure the consistency of the same group of clusters.
2) After the cluster is disconnected from the network, clean the dirty data.
3) Access Statistics.

client

  1) Provide an interface to access the Tair cluster on the application side.
  2) Update and cache the data distribution table and invalidserver address, etc.
  3) LocalCache, to avoid overheated data access affecting tair cluster services.
  4) Flow control

table (comparison table)

      The comparison table is mainly to solve load balancing. The number of rows in the comparison table is a fixed value. This fixed value should be much larger than the number of physical machines in a cluster. Since the comparison table needs to be synchronized with each client using Tair, it cannot be Too large, otherwise synchronization will bring a large overhead. We typically have 1023 rows in production.

configId

Uniquely identifies a tair cluster, and each cluster has a corresponding configID. In most current applications, the configID is stored in diamond (? It should be a management platform), which corresponds to the configserver address and groupname of the cluster. The business needs to configure this ConfigID when initializing the tailclient.

namespace

Also known as area, it is a memory or persistent storage area allocated to the application in the tair. It can be considered that the application data exists in its own namespace. The namespace is unique within the same cluster (same configID).
By introducing namespace, we can support different applications to use the same key to store data in the same cluster, that is, the key is the same, but the content will not conflict. Under a namespace, if the same key is stored, the content will be affected, and it will be overwritten in the simple K/V form. The content of storage engines with data structures such as rdb will change according to different interfaces.

load balancing of tair

The distribution of tair adopts a consistent hashing algorithm. All keys are divided into Q buckets. Buckets are the basic unit of load balancing and data migration.

The config server assigns each bucket to a different data server according to a certain strategy. Because the data is hashed according to the key, it can be considered that the data in each bucket is basically balanced. The balance of the bucket distribution is guaranteed, and the balance of the data distribution is guaranteed.

 Tair's data consistency

Reliability and consistency in distributed systems cannot be guaranteed at the same time because we have to allow network errors to occur. tair uses replication technology to improve reliability, and some optimizations are made to improve efficiency. In fact, when no errors occur, tair provides a strong consistency. However, when a data server fails, customers may not be able to read the latest data within a certain time window. Even the latest data may be lost.

 data server data migration process

When migration occurs, suppose data server A wants to migrate buckets 1, 2, and 3 to data server B. Because the client's routing table has not changed before the migration is completed, the access requests to 1, 2, and 3 will be routed to A. Now suppose that 1 has not been migrated, 2 is being migrated, and 3 has been migrated, then:

  • If it's an access to 1, nothing special, the same as before;
  • If it is an access to 3, A will forward the request to B, and return the result of B to the client;
  • If it is an access to 2, it will be processed in A, and if it is a modification operation of 2, the modification log will be recorded. When the migration of bucket 2 is completed, the log will be sent to B, and these logs will be applied on B, and finally on AB. For bucket 2, the complete data consistency is the real migration completion.

 dataserver down?

 When a data server fails and is unavailable, the config server will find out this situation. The config server is responsible for recalculating the distribution table of a new bucket on the data server, and reassigns the access of the bucket originally served by the faulty machine to in other data servers.

If the migration is caused by the downtime of a data server, the client will receive an intermediate temporary state allocation table. In this table, the buckets responsible for the downed data server are temporarily assigned to the backup data server for processing. . At this time, the service is available, but the load may be unbalanced. When the migration is completed, a new state of load balancing can be achieved again.

At this time, data migration may occur. For example, the bucket originally in charge of data server A needs to be in charge of B in the new table. If there is no data in the bucket on B, then the data will be migrated to B. At the same time, configure The server will find out which buckets have fewer backups, and then increase the backups of these buckets on the data server with lower load according to the load. When the system increases the data server, the config server will coordinate the part of the data server to control them according to the load. The bucket is migrated to the new data server. After the migration is complete, adjust the route.

version control of stored data by tair

Each data stored in Tair has a version number, and the version number will increase after each update. Correspondingly, there is also this version parameter in the Tair put interface. This parameter is set to solve the concurrent update of the same data, similar to for optimistic locking.
In many cases, the update data is to get first, modify the data returned by get, and then put it back to the system. If multiple clients get the same data and modify and save it, the modification saved first will be overwritten by the modification that arrives later, resulting in data consistency problems. In most cases, the application can accept it, but In a few special cases, this is not what we want to happen.
For example, there is a value "1" in the system, and now both A and B clients have obtained this value at the same time. After that, both A and B clients want to change this value. Suppose that A is changed to 12 and B is changed to 13. If no control is applied, no matter who A or B is successfully updated first, its update will be overwritten by the later update. . The version mechanism introduced by Tair avoids such problems. In the example just now, it is assumed that A and B fetch data at the same time. At that time, the version number is 10. A is updated first. After the update is successful, the value is 12 and the version is 11. When B is updated, since the version number it is based on is 10, the server will reject the update and return a version error, thereby preventing A's update from being overwritten. B can choose to get the value of the new version, and then modify it based on it, or choose to force the update.

The logic of Version change is as follows:
1) If new data is put and the version number is not set, the version will be automatically set to 1.
2) If put is updating old data and there is no version number, or the parameter version sent by put is the same as the current version, the version number will be incremented by 1.
3) If put is to update old data and the passed parameter version is inconsistent with the current version, the update fails and VersionError is returned.
4) If the version parameter passed in when putting is 0, the forced update is successful, and the version number is incremented by 1.

tair's storage engine

mdb is Tair's earliest cache storage engine , and it is also the most widely used centralized cache. It is especially suitable for application scenarios with small capacity (generally at the M level, within 50G) and high read and write QPS (10,000 level). It has a similar memory management method to memcached. mdb supports the use of share memory, which allows us to restart the process of the Tair data node without causing data loss, so that the upgrade is smoother for the application and does not lead to large fluctuations in the hit rate.

RDB is another memory-based product developed based on redis. Tair extracts the internal storage engine of redis and supports all data structures of redis. Therefore, rdb not only supports the structure of key corresponding to one value, but also supports the structure of key corresponding to multiple values. Can be list/map/set/zset. Use the tree method to index the data according to the hash value of the data key to speed up the search speed. The index file is separated from the data file, and the index file is kept in memory as much as possible to reduce IO overhead . Deleted space is managed using a free space pool.

Ldb is open sourced by Google, and it is aimed at high-performance storage and can be accelerated by embedded mdb cache. In this case, the data consistency between cache and persistent storage is maintained by tair. Support k/v, prefix and other data structures.

Detailed explanation of leveldb

The static structure of LevelDb consists of six main parts: MemTable and Immutable MemTable in memory and several main files on disk: Current file, Manifest file, log file and SSTable file.

ldb read and write operations

When the application writes a Key:Value record, LevelDb will append the write to the log file first, and then insert the record into the Memtable after success, so that the write operation is basically completed, because a write operation only involves one disk. Sequential writes and one memory write, which is the main reason why LevelDb writes extremely fast. The function of the log file in the system is mainly for system crash recovery without losing data. If there is no log file, because the written records are stored in the memory at the beginning, if the system crashes at this time, the data in the memory has not been There is no time to Dump to disk, so data will be lost. In order to avoid this situation, LevelDb records the operation in the Log file before writing to the memory, and then records it in the memory, so that even if the system crashes, the Memtable in the memory can be recovered from the Log file without causing data loss. lost.

When the data inserted by Memtable occupies a limit of memory, the memory records need to be exported to the external memory file. LevleDb will generate a new Log file and Memtable, and the original Memtable will become Immutable Memtable, as the name suggests, that is, the content of this file is immutable. The newly arrived data is recorded in a new Log file and Memtable, and the LevelDb background scheduling will export the Immutable Memtable data to disk to form a new SSTable file. The SSTable is formed by continuously exporting the data in the memory and performing the Compaction operation, and all the files of the SSTable are a hierarchical structure, the first layer is Level 0, the second layer is Level 1, and so on, the level gradually increases, this It is also the reason why it is called LevelDb.

LevelDB will first look at the Memtable in memory

If the Memtable contains the key and its corresponding value, just return the value;

If the key is not read in the Memtable, then read it from the Immutable Memtable that is also in memory.

Neither Memtable nor Immutable Memtable exists, so I have no choice but to search from a large number of SSTable files on the disk.

Because there are a large number of SSTables and they are divided into multiple levels, first search from the files belonging to level 0, if found, return the corresponding value value, if not found, then go to the files in level 1 to find, and so on and so forth, until Until the value corresponding to the key is found in the SSTable file of a certain layer (or the highest level is found, the search fails, indicating that the key does not exist in the entire system).

Tair VS Redis

 
Say it again
Three
Be applicable
  1. Complex data structures (map, set) need to be used, and there are many elements in map/set (more than 1000)
  2. Latency Sensitive Services
  1. No tolerance for data loss
  2. Large amount of data, services that cannot be stored in memory
  3. Multilingual client support required
not applicable
  1. The amount of data exceeds 600GB (too much data, full memory is too wasteful of resources)
  2. Multilingual client support required
  1. Using complex data structures (map/set), there are many elements in map/set (more than 1000)

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326491684&siteId=291194637