Zhuangshi Data Technology 04: ETL

hi, Minasang, Zhuangshi will meet you again on Saturday morning~

Insert picture description here

After the "Zhuangshixue Data Technology 03: Data Access" in the previous section, we have to enter the stage of data development. When we understand data development, we can't get around a word: ETL.

So what is ETL? Why do we need ETL? What are the ETL tools on the market? Today, Zhuangshi will take you to trade ETL.

01 What is ETL
Insert picture description here

We mentioned in the last book that after data is connected to the data warehouse, it needs to go through a series of operations for the business side to use. And this series of operations is simply that the data must be integrated according to unified rules. We integrate these rules and call it a data warehouse model.

If the data warehouse model is compared to a building and the data is bricks and tiles, then ETL is the process of building a building. It links the data source and the two ends of the data warehouse.

In the entire project of data warehouse construction, the most difficult part is user demand analysis and model design, while the ETL rule design and implementation is the most workload, accounting for about 60% to 80% of the entire project. General consensus obtained in practice.

ETL (Extract-Transform-Load) refers to the process of extracting, transforming, and loading data from the data source to the destination. We extract the required data from the data source, after data cleaning, and finally load the data into the data warehouse according to the pre-defined data warehouse model. There are all kinds of synchronous data and asynchronous data, common ones are mysql2hive, hive2hive, hive2mysql, etc.

There are three common implementations of ETL. One is to use ETL tools, one is SQL, and the other is a combination of ETL tools and SQL. The first two methods have their own advantages and disadvantages. With the help of tools, ETL projects can be quickly established, which can shield complex coding tasks, increase speed, and reduce difficulty, but they lack flexibility. The SQL method has the advantage of flexibility and improves the efficiency of ETL operation, but the coding is complex and requires relatively high technical requirements. The third is to combine the advantages of the previous two, greatly improving the speed and efficiency of ETL development.

ETL tools include Datastage, Informatica, Kettle, etc. Among them, Datastage and Informatica are expensive commercial tools, and Kettle is an open source tool developed based on Java. These tools can be used for ETL development through drag-and-drop configuration.

02 Why do we need ETL

So why do you need ETL? The main reasons are as follows:

When the data comes from different physical machines, it will waste computing resources if SQL processing is used at this time.

The data comes from different databases or files, and they need to be sorted into a unified format before data processing can be performed. This process is very troublesome to implement with code.

When processing massive data, it will take up more database resources, which will lead to insufficient database resources, and then affect database performance.

Simply put, after performing ETL with different data sources, it will save computing resources, storage resources, and code will be much simpler, saving money, effort, and worry.
Insert picture description here

03 ETL Tool Link

3.1 ETL tool selection basis

1. The degree of support for the platform

2. Is the performance of extraction and loading high, and has little impact on the performance of the business system, and is not highly intrusive

3. The degree of support for data sources

4. Whether it has good integration and openness

5. Data conversion and processing functions are not strong

6. Whether it has management and scheduling functions

3.2 Recommendations of mainstream ETL tools

Commercial
Insert picture description here

Insert picture description here

enterprise

Ali-Royal Dining Room

We can take a look at the product of Ali Yushanfang. In the entire process of data development, it is carried out in a process of "model design", "data development", "release", and "operation and maintenance".

Insert picture description here

A similar design also has Ant's ETL data platform:

Insert picture description here

NetEase Mammoth Big Data Platform

In the data development module of the big data development kit, the Netease Mammoth big data platform provides agile development interfaces for various types of tasks such as database transmission, SQL, Spark, OLAP Cube, MapReduce and Script. Task developers can create tasks by dragging and dropping, which is convenient Carry out data science work such as data integration, data ETL, and data analysis. Taking database transmission as an example, the user only needs to drag and drop the "database transmission" component onto the canvas and double-click, select through the drop-down box and manually enter and fill in the form to quickly complete the task development of data transmission.
Insert picture description here

Byte beating priest

Users can perform unified hosting and declarative ETL definitions. Quote https://myslide.cn/slides/4103
Insert picture description here

It can be seen that the current ETL tools still present a multitude of ways, including SQL-like mode, drag-and-drop mode, and package configuration mode.

Of course, the product types of data development are far more than that, but also UDF, DSN, ETL parameter tuning (too many small files & data skew), transform, etc. are also designed.

In addition, there are ETL scheduling, SLA links, etc., which are designed to the data scheduling and data production process after data development. If students are interested in data development products, you can make an offline self-service inquiry~

Alright, today’s "Zhuang Shi Xue Data Technology 04: ETL" is here, thank you for watching~

Insert picture description here

The private place of a data person is a big family that helps the data person grow, helping partners who are interested in data to clarify the learning direction and accurately improve their skills. Follow me and take you to explore the magical mysteries of data

1. Go back to "Data Products" and get <Interview Questions for Data Products from Big Factory>

2. Go back to "Data Center" and get <Dachang Data Center Information>

3. Go back to "Business Analysis" and get <Dachang Business Analysis Interview Questions>;

4. Go back to "make friends", join the exchange group, and get to know more data partners.

Guess you like

Origin blog.csdn.net/weixin_49880348/article/details/110286445
ETL