What is ETL? How to learn ETL faster?

ETL is the abbreviation of Extract-Transform-Load in English. It is used to describe the process of extracting, transforming, and loading data from the source to the destination. Source data (such as relational data) is extracted, and "dirty" data such as incomplete data, repeated data, and wrong data are cleaned according to pre-designed rules, and "clean" data that meets the requirements are obtained and loaded into the data warehouse for further processing. These "clean" data become the cornerstone of data analysis and data mining.

ETL is the core of business intelligence (BI). Under normal circumstances, ETL will take one-third of the time of the entire BI project, so the design of ETL directly affects the success or failure of the BI project.

There are many ways to implement ETL commonly used in enterprises, and the common ways are as follows.

(1) With the help of ETL tools (such as Pentaho Kettle, Informatic, etc.).

(2) Write SQL statements.

(3) Combine ETL tools with SQL statements.

The above three implementation methods have their own advantages and disadvantages. The first method can quickly establish an ETL project, shield complex coding tasks, speed up and reduce difficulty, but lacks flexibility: the second method uses the method of writing SQL statements. It is flexible and can improve the operating efficiency of ETL, but the coding is complex and requires relatively high technical requirements; the third method combines the advantages of the previous two methods and can greatly improve the development speed and efficiency of ETL.

ETL architecture

ETL is mainly used to realize data integration of heterogeneous data sources. Most of all the original data from multiple data sources are loaded into ETL without modification. Therefore, no matter the data source is in a relational database, a non-relational database, or an external file. The integrated data will be placed in the database Dimension tables of data tables or data warehouses for further transformation in databases or data warehouses (thus, the final data is generally stored in databases or data warehouses). The architecture of ETL is shown in the figure below.

ETL architecture

In the figure above, if both data source 1 and data source 2 are powerful DBMSs (database management systems), SQL statements can be used to complete part of the data cleaning work. However, if the data source is an external file, SQL statements cannot be used for data cleaning. It can only be extracted directly from the data source, and then data cleaning is performed during data conversion. Therefore, the data cleaning work in the data warehouse is mainly carried out during data conversion. The cleaned data will be saved to the target database for subsequent data analysis, data mining and business intelligence.

How to learn ETL

In order to let you understand ETL development, use Python language to complete ETL tasks at the same time, comprehensively use Python advanced and MySQL database, Dark Horse Programmer launched a free Python+ETL course. In the following roadmap, you can not only learn ETL, but also learn more big data knowledge.

Guess you like

Origin blog.csdn.net/Blue92120/article/details/131450240