Compare Kudu, Hudi and the Delta Lake

Compare Kudu, Hudi and the Delta Lake

kudu, hudi and delta lake is currently more popular support row-level data storage scheme additions and deletions to change search, the paper was compared between the three.

Storage mechanism

kudu
latest data stored in memory, called MemRowSet (line storage, ordered based on primary key),
when filled MemRowSet (1G or default 120s) flush to the disk, is formed DiskRowSet (column storage).
tablet compaction operation performed regularly DiskRowSet, reordering,
clear DeltMemStore data in the update and delete, and reduce the number of DiskRowSet.

HUDI
HUDI axis maintains a record of all operations performed on the data sets at different times.
hudi have two kinds of storage optimization.
Read optimization (Copy On Write): will the latest data compaction into a columnar storage (parquet) after each commit;
write optimization (Merge On Read): use the line memory (avro) for incremental data, the background on a regular basis it compaction storage columns.

Lake Delta
Delta Lake data is not stored in memory, but is written directly to the new data file, while adding AddFile this FileAction in the commit log, the snapshot create / read transaction log updates when new data is loaded file information.

Read Data

Kudu
Client request to the master, master checking table, Schema and the master key is present, and by querying the catalog Table, the tablet corresponding tserver address, tserver status metadata information return, client establishes a connection with tserver, by metaData found primary key corresponding RowSet, return transactions;

HUDI
HUDI maintains an index to support key presence in the record, the new record will be key to quickly mapped to the corresponding fileId. Achieve index is plug-in, the default is bloomFilter, you can use HBase.
hudi provides three query view.
Read enhancement view: providing a data storage compaction column only;
delta view: once only provide incremental data before compaction / commit;
Live View: a column comprising a read storage data stored in the line optimization and optimization data writing .

delta lake
by reading the transaction log checkpoint file (parquet format) and later versions of the document commit (json format), the latest snapshot of the current build, the snapshot contains the address of the current versions of all data files.
delta lake is based on the realization of the spark with spark the same read-optimized.

update data

Kudu
Client requests to the master table prewritten metadata information, and then connected to the corresponding tablet tserver metadata, if the data in memory (MemRowSet), the mutation will list information is written in the row, if the disk (DiskRowSet) on, it will update the information in written DeltMemStore;

HUDI
HUDI when writing data to specify PRECOMBINE_FIELD_OPT_KEY, RECORDKEY_FIELD_OPT_KEY and PARTITIONPATH_FIELD_OPT_KEY.
RECORDKEY_FIELD_OPT_KEY: id unique for each record, supports a plurality of fields;
PRECOMBINE_FIELD_OPT_KEY: when using the combined data, when RECORDKEY_FIELD_OPT_KEY the same, the default field takes PRECOMBINE_FIELD_OPT_KEY line configuration attributes corresponding to the maximum;
PARTITIONPATH_FIELD_OPT_KEY: used to store data partition field.
hudi data update and insert data is very similar to (almost the same as written), update data, the data will Merge according RECORDKEY_FIELD_OPT_KEY, PRECOMBINE_FIELD_OPT_KEY and PARTITIONPATH_FIELD_OPT_KEY three fields.

delta lake
to write data to locate delta lake will be updated data file where the data is updated, then the data and the updated file in other unwanted updated with a new file, while recording AddFile (new in the commit log in files) and RemoveFile (old files) two kinds of action.

other

-- Must severe Delta Lake
Use Index Yes Yes no
Metadata position master The root folder The root folder
Version rollback not support hudi There timeline, support for version rollback delta lake there is a transaction log system that supports version rollback
real-time kudu use memory to store new data, real-time high hudi write-optimized storage, real-time high delta lake must be completed in order to submit queries to commit new data, relatively low real-time
Hadoop file system support It does not support, kudu by raft to manage their own storage server stand by stand by

Guess you like

Origin www.cnblogs.com/kehanc/p/12153409.html