Kudu's architecture and advantages

introduce:

Since the development of the Hadoop ecosystem, the storage layer is mainly controlled by two systems, HDFS and HBase, and there has been no major breakthrough. In the batch processing scenario that pursues high throughput, we choose HDFS. In the scenario of pursuing low latency and random read and write requirements, we choose HBase. Is there a system that can combine the advantages of the two systems and support high throughput at the same time rate and low latency ? Some people try to modify the HBase kernel to construct such a system, that is, to retain the HBase data model and change its underlying storage part to pure columnar storage (currently HBase can only be regarded as a column cluster storage engine), but this modification is more difficult. The emergence of Kudu is expected to solve this problem.
Kudu is Cloudera's open-source columnar storage engine with the following features:

  • C++ language development, Kudu's API can use Java and C++
  • Efficiently handle OLAP-like workloads
  • Friendly integration with MapReduce, Spark, and other components in the Hadoop ecosystem
  • Can be integrated with Cloudera Impala, replacing the HDFS+Parquet combination commonly used by Impala
  • Flexible Consistency Model
  • Good performance can still be achieved in scenarios where sequential writes and random writes coexist
  • High availability, using Raft protocol to ensure highly reliable data storage
  • structured data model
The emergence of Kudu is expected to solve a large class of problems that are difficult to solve in the current Hadoop ecosystem, such as: time-series related applications
of streaming real-time calculation results update . The specific requirements are:

  • Query massive historical data
  • Query individual data and ask for quick returns
  • In predictive models, models are updated periodically and decisions are made quickly based on historical data

Features:

  High availability. Tablet server and Master use Raft Consensus Algorithm to ensure high availability of nodes, ensuring that as long as more than half of the replicas are available, the tablet can be used for reading and writing. For example, the tablet is available if 2 out of 3 replicas or 3 out of 5 replicas are available. Even in the case of leader tablet failure, the read function can be served by read-only (read-only) follower tablets, or in the case of leader failure, the leader will be re-elected according to the raft mechanism.

Basic concepts:

Development language: C++

Columnar Data Store

Read Efficiency

  For analytical queries, allows reading a single column or part of that column while ignoring other columns

Data Compression

  Since a given column contains only one type of data, schema-based compression is orders of magnitude more efficient than compressing mixed data types (as used in row-based solutions). Combined with the efficiency of reading data from columns, compression allows you to complete queries while reading fewer blocks from disk

Table

  A table is where data is stored in Kudu. A table has a schema and a globally ordered primary key (primary key). A table is divided into segments, also known as tablets.

Tablet (segment)

  A tablet is a continuous segment of a table, similar to a partition in other data storage engines or relational databases. A given tablet is redundant across multiple tablet servers, and at any given point in time, one of the replicas is considered the leader tablet. Reads can be serviced by any replica, and writes require consensus among the set of tablet servers serving the tablet. 
  A table is divided into multiple tablets, which are distributed in different tablet servers. The maximum parallel operation 
Tablet is divided into smaller units in Kudu, called RowSets. RowSets are divided into two types: MemRowSets and DiskRowSets. Each time MemRowSets generate 32M, It is overflowed to the disk, that is, DiskRowSet

Tablet Server

  A tablet server stores tablets and serves them to clients. For a given tablet, one tablet server acts as the leader, and other tablet servers act as follower replicas for that tablet. Only the leader serves write requests, whereas the leader or followers serve read requests for each service. The leader is elected using the Raft Consensus Algorithm. A tablet server can serve multiple tablets, and a tablet can be served by multiple tablet servers.

Master

  The master keeps track of all tablets, tablet servers, Catalog Table and other cluster related metadata. At a given point in time, there can only be one active master (that is, the leader). If the current leader disappears, a new master is elected using the Raft Consensus Algorithm. 
  The master also coordinates client metadata operations. For example, when a new table is created, the client internally sends the request to the master. The master writes the metadata of the new table to the catalog table and coordinates the creation of the tablet on the tablet server. 
  All master data is stored in a tablet that can be replicated to all other candidate masters. 
The tablet server sends heartbeats to the master at set intervals (the default is once per second). 
The master is stored on disk in the form of a file, so the cluster is initialized for the first time. need to be set

Raft Consensus Algorithm

  Kudu uses the Raft consensus algorithm as a means of ensuring fault tolerance and consistency of regular tablet and master data. With Raft, multiple replicas of the tablet elect a leader, which is responsible for accepting and replicating writes to follower replicas. Once the written data is durable across the majority of replicas, it is acknowledged to the client. A given set of N replicas (usually 3 or 5) can accept writes from at most (N - 1)/2 replicas with errors.

Catalog Table

  The catalog table is the central location of Kudu's metadata (in metadata). It stores information about tables and tablets. The catalog table may not be read or written directly. Instead, it is only accessible through metadata operations exposed in the client API. The catalog table stores two types of metadata.

Tables

  table schemas, locations, and states

Tablets

  A list of existing tablets, which tablet servers are replicas of each tablet on, the current state of the tablet, and start and end keys. 
write picture description here 

Notice:

  1. When building the table, all tserver nodes are required to be alive 
  . 2. According to the raft mechanism, if (the number of copies of replication -)/2 is allowed to go down, the cluster will still run normally, otherwise it will report an error that ip: 7050 cannot be found (7050 is rpc communication port number), you need to pay attention to a problem. When running for the first time, make sure that the cluster is in a normal state, that is, all services are started. If it is running, allow (the number of copies of replication -) / 2 downtime Drop 
  3. Read operations, as long as one is alive, it can run


write picture description here 
  The diagram above shows a Kudu cluster with three masters and multiple tablet servers, each supporting multiple tablets. It shows how to use Raft consensus to allow leader and follower of master and tablet servers. In addition, a tablet server can be the leader of some tablets and the follower of other tablets. Leaders are shown in gold, while followers are shown in blue.

测试:
7个tablet server
ssd硬盘,5分钟manul flush到kudu 1000万数据
  • 1
  • 2
  • 3

Summary: 
  1. The number of KUDU partitions must be predetermined 
  . 2. Maintain a MemRowSet for each Tablet partition in memory to manage the latest updated data. The default is to refresh once 1G or 2 minutes. After Flush is sent to the disk to form DiskRowSet, multiple DiskRowSets are merged at the appropriate time. 
  3. Unlike the LSM (LogStructured Merge used by HBase, it is difficult to perform special encoding on the data, so the processing efficiency is not high) The difference is that Kudu has The merging of data update records of the same row does not happen during query (HBase will Flush multiple update records to different Storefiles successively, so when reading, you need to scan multiple files, compare rowkeys, compare versions, etc., Then perform the update operation), but at the time of update. In Kudu, a row of data will only exist in one DiskRowSet, avoiding comparison and merging during read operations. So how does Kudu do it? For columnar-stored data files, it is very difficult to change a row of data in place, so in Kudu, for the DiskRowSet (DRS) data from Flush to disk, there are actually two forms, one is Base The data exists in a columnar storage format. Once generated, it will not be modified. The other is the Delta file, which stores the changed data in the Base data. One Base file can correspond to multiple Delta files. This method means, When inserting data, compared with HBase, an additional retrieval process is required to determine whether the data corresponding to the primary key already exists. Therefore, Kudu sacrifices write performance in exchange for improved read performance. 
Update and delete operations need to be recorded in a special data structure, which is stored in DeltaMemStore in memory or in DeltaFIle on disk. DeltaMemStore is implemented by B-Tree, so it is fast and modifiable. DeltaFIle on disk is a binary columnar block, which is immutable like base data. Therefore, when the data is frequently deleted and modified, there will be a large number of DeltaFiles files on the disk. Kudu borrows the method of Hbase and will merge these files regularly. 
write picture description here
  4. Since there is Delta data, it means that the Base file and the Delta file need to be retrieved at the same time when the data is queried. This seems to be coming together with the HBase solution. The difference is that Kudu's Delta file and Base file. The difference is that it is not sorted by Key, but retrieved by the displacement of the updated row in the Base file. It is said to do so. When locating the Delta content, there is no need to perform string comparison work, so it can greatly speed up the positioning speed. , but anyway, the presence of Delta files has a huge impact on retrieval speed. Therefore, the number of Delta files needs to be controlled, and it needs to be merged with the Base data in time. Since the Base file is stored in a columnar format, delta files can be merged selectively. For example, only the columns that change frequently are merged, and the columns with few changes are kept in the delta file and will not be merged for the time being. Reduce unnecessary IO overhead. 
5. In addition to delta file merging, DRS itself also needs to be merged, in order to ensure the predictability of retrieval delay (this is one of the pain points of HBase, for example, when a major compaction occurs in a partition, the read and write performance will be greatly affected), Kudu Compared with HBase, the compaction strategy of Kudu is very different. The compaction of kudu's DRS data files is not essentially to reduce the number of files. In fact, Kudu DRS is split in units of 32MB by default, and the compaction of DRS is not reduced. The number of files, but the content is sorted and reorganized to reduce the overlap (duplication) of keys between different DRSs, thereby reducing the number of DRSs that need to be retrieved during retrieval. 
write picture description here
write picture description here

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326660392&siteId=291194637