Huawei Big Data HCIE Data Mining--ETL

What is ETL

Insert image description here
ETL is a data pipeline that is responsible for cleaning, converting, and integrating distributed and heterogeneous data (Extract stage) according to certain business rules (Transform stage), and finally loading the processed data to the data destination (Load stage) ), such as a data warehouse.

What are the points that need to be paid attention to when extracting data?

Check data type;
ensure data integrity;
remove duplicate data;
remove dirty data;
ensure exported data attributes are consistent with source data

What are the methods of data extraction?

Update extraction
When new data is added to the source system or a data update operation occurs, the system will issue a reminder. This is the simplest method of data extraction.
Full extraction:
When new data is added to the data source or data update operations occur, the system will not issue a reminder. At this time, full extraction can be used. Full extraction is similar to data migration or data replication. It extracts the data from the tables or views in the data source intact from the database and converts it into a format that can be recognized by its own ETL tool. Full extraction is relatively simple and is generally only used during system initialization. After a full extraction, incremental extraction will be used every day.
Incremental extraction
When new data is added to the data source or a data update operation occurs, the system will not issue a reminder, but the updated data can be identified. In this case, incremental extraction can be used. Incremental extraction only extracts new or modified data in the database table since the last extraction. In ETL, incremental extraction is more widely used.

What are the ways to load data?

Full Load Full Load
clears the entire table before loading data.
From a technical point of view, it is simpler than incremental loading. Generally, you only need to clear the target table before loading the data, and then import all the data from the source table. However, when the amount of source data is large and the real-time nature of the business is high, large batches of data cannot be successfully loaded in a short time. In this case, it needs to be used in conjunction with incremental loading.
Incremental Load Incremental Load
target table updates only the changed data in the source table.
The difficulty of incremental loading lies in updating the positioning of data. Clear rules must be designed to extract data with changed information from the data source, and update these changed data to the data destination after completing the corresponding logical transformation.

What are the specific forms of incremental loading?

System log analysis method
Trigger method
Timestamp method
Full table comparison method
Incremental data is loaded directly or after conversion

What are the criteria for judging whether a loading method is good or bad?

Change data in business systems can be accurately captured by frequency.
Try to reduce the pressure on business systems and the impact on existing businesses.
Able to implement attribute mapping very well.
Data can be quickly restored or rolled back.

Compared with ETL, what are the advantages of ELT?

Simplify ETL architecture. There is no need to use a separate conversion engine after data extraction, the data is converted and consumed in the same place.
Reduce extraction time and performance overhead. In actual applications, different businesses have different data requirements and require different conversion operations on the same set of data. ETL requires multiple extractions, conversions, and loads, while ELT can achieve one extraction, loading, and multiple conversions, allowing one piece of data to be applied multiple times, reducing time and resource overhead.

Guess you like

Origin blog.csdn.net/qq_37633855/article/details/123618599