ApacheHudi Frequently Asked Questions Summary

Welcome to public concern number: ApacheHudi

1. ApacheHudi when useful for individuals and organizations

If you want to quickly extract data into HDFS or cloud storage, Hudi can help. In addition, if your ETL / hive / spark job very slow or resource-intensive, it can Hudi reader and method of writing data to help by providing an increment.

As an organization, Hudi can help you build efficient data lake, solve some of the most complex issues underlying storage management, while the data to the faster data analysts, engineers and scientists.

2. Hudi not going to reach the goal

Hudi not the case for any OLTP designed, in these cases, usually you are using an existing NoSQL / RDBMS data storage. Hudi can not replace your memory analytical database (at least not yet!). Hudi support the achievement of near real-time ingestion in a few minutes, so the trade-off delayed for efficient batch processing. If you really want to sub - minute processing delays, please use your favorite stream processing solutions.

3. What is an incremental process? Why Hudi been talking about it

Incremental processing by Vinoth Chandar first introduced in O'reilly blog, the blog explained most of the work. In purely technical terms, it refers only to incremental processing flow mini-batch process approach to the preparation process. A typical batch jobs every few hours will consume all inputs and recalculate all outputs. A typical continuous stream processing jobs / every few seconds, a number of new consumer input and recalculate new / changed to output. Although in batch mode to recalculate all of the output may be more simple, but very wasteful and expensive resource-consuming. Hudi has the ability to stream to write the same batch pipeline, run once every few minutes.

Although it can be called stream processing, but we prefer to call it an incremental process, as distinct from the use of Apache Flink, Apache Apex or Apache Kafka Streams built pure stream processing pipeline.

4. Copy (COW) merged with read-write (MOR) What is the difference between the storage type

Copy-on-write (Copy On Write): This type of storage enables clients to ingest data in columnar format (currently parquet). When using COW storage type, any new data written Hudi data sets are written to a new parquet file. Update existing rows will lead to rewrite the entire file parquet (parquet These files contain the affected row to be updated). Therefore, all written for such datasets are restricted parquet write performance, the greater the parquet file, the longer it takes to ingest data.

The combined read (Merge On Read): This type of memory allows clients to quickly uptake based on the data line (e.g. Avro) data format. When using storage type MOR, any new data written Hudi datasets are written to the new log / delta files that data internally avro encoding. Compression (Compaction) processes (configured as an embedded or asynchronous) to convert the file format of the log file format for a column of formula (parquet).

Two different formats provides two different views (real-time view and enhancement view read), a read column read performance enhancement view depends formula parquet file, and real-time view of the reader performance depending on the column type and / or log files .

Update the existing line will result in: a) from the corresponding write previously generated by compression (Compaction) basic parquet log file / update delta files; or b) in the log is written without compression / delta file is not performed update. Thus, all of the writing of such data sets is governed avro / log file write performance limitations (need to copy writing) speed is much faster than the parquet. Although, compared with the column of formula (Parquet) file, reads the log / delta file requires higher cost (when combined read).

Click here to find out more.

5. How to select the storage type for the work load

The main objective is to provide Hudi update feature that proportion write an entire table or partition to be orders of magnitude faster.

If the following conditions are met, then select Copy (COW) memory write:

  • Looking for a simple method of replacing the existing parquet table without having real-time data.
  • The current workflow is to rewrite the entire table / partition to handle updates, and each partition is actually only a few file changes.
  • I want to make operation easier (no compression), and intake / write performance only by the parquet file size, and limit the number of files affected by the update
  • Workflow is very simple and will not be a sudden outbreak of a large number of updated or inserted into the older partition. The combined cost paid when COW is written, therefore, these sudden changes can clog intake, and interfere with the normal uptake delay target.

If the following conditions are satisfied, the selection combining (MOR) memory read:

  • Data are taken up as soon as possible and hope that can be queried as quickly as possible.
  • Peak / sudden changes in workload may arise mode (for example, the bulk of the upstream older database update transaction results in a large number of updates to the DFS old partition). Asynchronous compression (Compaction) help to alleviate write amplification caused by this situation, and the need to keep up with changes of normal extraction stream upstream.

Regardless of storage options, Hudi will provide:

  • Isolation and bulk atomic write snapshot record
  • Incremental pull
  • Deduplication capability

Click here to find out more

6. Hudi is an analytical database yet

A typical database server has some long-running, in order to provide literacy services. Hudi architecture of contrast, it is highly decoupled read and write a corresponding expansion challenge query may be expanded independently write and / read. Therefore, it may not always be like a database.

Nevertheless, Hudi design very much like a database, and provide similar functionality (update, change capture) and semantics (transactional write, snapshot isolation read).

7. How the data model is stored in Hudi

When the data is written HUDI, as may be the key - as recorded on a modeling value is stored: the specified key fields (for a single partition / the entire data set is unique), partition field (represented by the key partition to be placed), and preCombine / combine logic (used to specify how to handle a number of duplicate records written records). This model allows Hudi can enforce the primary key constraint, just like on a database table. See here for an example.

When the query / read data, Hudi just presents itself as a level similar to json table, everyone is accustomed to using Hive / Spark / Presto to query the Parquet / Json / Avro.

8. Hudi whether to support cloud storage / Object Storage

In general, this function can be provided HUDI any Hadoop file system implementation, it is possible to read and write data sets on Cloud Store (Amazon S3 or Microsoft Azure or Google Cloud Storage). Hudi also a specific design, build the data set Hudi very easily in the cloud, for example, S3 is zero movement consistency check, according to the data files / rename.

9. Hudi support Hive / Spark / Hadoop which versions

From September 2019 began, Hudi can support Spark 2.1 +, Hive 2.x, Hadoop 2.7+ (non-Hadoop 3).

10. Hudi how data is actually stored in a data set

From a higher level perspective, Hudi MVCC-based design, the data is written parquet / basic files and log files contain different versions of the basic changes made to the file. All files are stored in a partition data set mode, which is Apache Hive on the table layout is very similar to DFS. Please refer here for more details.

Guess you like

Origin www.cnblogs.com/apachehudi/p/12150115.html