Common data misalignment and repair techniques in Hive [Reproduced]

Transfer from: https://zhuanlan.zhihu.com/p/348698298

Preface

In the process of big data development, it is inevitable to encounter data misalignment. The situation of data misalignment is usually in the upstream link of big data development. In order to ensure data quality, Hive table data needs to be repaired. This article is inspired by a real Hive data misalignment repair experience, and summarizes and expands the occurrence of data misalignment on this basis. Scenes, data misalignment repair ideas and repair case demonstration demos.

01 Scenes where data misalignment occurs

First, you need to be clear about the following two concepts:

The upstream data source table is data from different channels, such as relational database MySQL data, website or application buried data logs, and data provided by third parties, etc.

The downstream Hive table is here主要指的是ODS层的表,即从各个渠道来的数据所抽取到Hive的表

At the time of introduction into the raw data table Hive data misalignment may occur, the dislocation of the scene following two :

(1) The structure of the data source table has changed

Due to unavoidable reasons such as business adjustments or other uncontrollable factors, the structure of the upstream data source table has changed. This change in the table structure includes the three changes of data source table field addition, deletion, and modification. Complexity Increasing sequentially, these conditions will cause data misalignment.

(2) The delimiter of the data has changed

There is also data misalignment caused by the separator. The first case is 切换数据来源之后数据的分隔符和以前的不一致, the second case is 某些字段中包含了分隔符that both of these two conditions will cause data misalignment.

02 The idea of ​​data restoration

We have described the scenario of the problem in the previous section, and the next step is to determine the idea of ​​solving the problem:

(1) Repair ideas when the structure of the data source table changes

In a scenario where the structure of the data source table changes, regardless of the addition, deletion, or modification, the core solution is to process the latest data by creating a temporary table and then backfill it to the Hive table to complete the repair.Insert picture description here

  • The data source table fields are increased . If the new fields are needed for downstream processing of the Hive table, the table is rebuilt according to the location of the new fields, and the historical data is backfilled into the new table. In general, the table structure of the new fields is Add new fields to the back of the old table. If the new field is not behind the old table, create a new one according to the source table structure; if the new field is not needed for downstream processing, create a Hive temporary intermediate table, the table structure is consistent with the data source table , Backfill the required data from the Hive temporary intermediate table to the Hive table

  • The data source table fields are deleted . If the deleted fields are needed for downstream processing of the Hive table, the downstream must be notified in advance to evaluate the impact and feedback the solution; if the deleted fields are not used downstream, create a new Hive temporary intermediate table and backfill the data to In the Hive table, the structure of the Hive table remains unchanged, and the deleted fields are blanked

  • Data source table field modification ( change of field name or type, data misplacement and field inconsistency), field name or type of data source table has changed, if the modified field needs to be used in downstream processing of Hive table, downstream evaluation needs to be notified in advance Influence and feedback the solution; the data source table has data misalignment and the field is inconsistent. This situation is more complicated. You need to put the field names and several data of the old data and the new data in the same Excel table for comparison, and then refer to the complete The data determines the field names of the new table according to the field names of the old table, and try to ensure that the field names of the common fields of the two tables are consistent. If the fields in the old table are not fully reflected in the new table, you need to notify the downstream to evaluate the impact and feedback the solution in advance If it is not used, the unused fields will be blanked when the Hive temporary intermediate table is backfilled into the Hive table

(2) Repair ideas when the data separator changes

If the separator of the data is inconsistent after switching the data source, you can convert the separator first, and then upload the data, and communicate with the upstream format requirements.

Some fields contain delimiters, and some fields of CSV format files appear','. This situation is more complicated and unreasonable. If you still use English commas, it will inevitably lead to data misalignment., This needs to be agreed with the upstream data provider to change the separator to other uncommon characters such as'\t' or other special symbols that will not appear in the field content

Summary In the process of developing big data, it is inevitable that you will encounter Hive data misalignment problems. The reasons for the problems include insufficient data research and poor upstream and downstream communication. After the problem occurs, let yourself calm down and sort out your ideas, and then repair , The ultimate goal is to ensure data quality without affecting downstream processing.

Guess you like

Origin blog.csdn.net/qq_39900031/article/details/115337326