Introduction to data collection
ETL is basically a representative of data collection, including data extraction (Extract), transformation (Transform) and loading (Load). The data source is the upstream of the entire big data platform, and data collection is the pipeline between the data source and the data warehouse. In the collection process, the data is managed according to the business scenario and the data cleaning is completed.
In the big data scenario, data sources are complex and diverse, including business databases, log data, pictures, videos and other multimedia data. The form of data collection also needs to be more complex and diverse, including timing, real-time, incremental, full volume, etc. Common data collection tools are also diverse, which can meet a variety of business needs.
A typical data loading architecture:
Three common data collection scenarios:
- Scenario 1: Obtain data from data sources that support FTP, SFTP, HTTP and other protocols
- Scenario 2: Obtain data from the business database, and support the business system after data collection and entry
- Scenario 3: The data source needs to collect data in real time through message queues such as Kafka
Data collection system requirements:
- Data source management and status monitoring
- Multi-mode data acquisition and task monitoring such as timing, real-time, full, incremental, etc.
- Metadata management, data supplementary acquisition and data archiving
Common data collection tools
Sqoop
Sqoop is a commonly used data import and export tool between relational databases and HDFS, which translates import or export commands into MapReduce programs. Therefore, it is often used to transfer data between Hadoop and traditional databases (Mysq|, Postgresq|, etc.).
Data can be imported from relational database to Hadoop cluster through Hadoop MapReduce. The process of using Sqoop to transfer large amounts of structured or semi-structured data is completely automated.
Sqoop data transmission diagram:
Sqoop Import process:
- Get the MetaData information of the source data table
- Submit MapReduce tasks based on parameters
- Each row in the table is used as a record, and data is imported as planned
**Sqoop Export process:***
- Get the MetaData information of the target data table
- Submit MapReduce tasks based on parameters
- Split each line of data in the HDFS file by specified characters and export to the database
Apache Flume
Apache Flume is essentially a distributed, reliable, and highly available log collection system, supporting multiple data sources and flexible configuration. Flume can collect, aggregate and transmit massive logs.
The Flume system is divided into three components, namely Source (responsible for reading data sources), Sink (responsible for data output), and Channel (as a temporary storage channel for data). These three components will form an Agent. Flume allows users to build a complex data flow, such as data flowing through multiple agents and finally landing.
Flume data transmission diagram:
Schematic diagram of data transmission in Flume with multiple data sources and multiple agents:
Schematic diagram of data transmission under Flume multi-Sink multi-Agent:
For the practical content of Flume, please refer to:
DataX
Official documents:
DataX is Ali's open source offline synchronization tool for heterogeneous data sources. It is committed to achieving efficient and stable data synchronization functions between various heterogeneous data sources such as relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc. DataX turns the complex mesh synchronization link into a star data synchronization link, which has good scalability.
Comparison chart of mesh synchronization link and DataX star data synchronization link:
Schematic diagram of DataX architecture:
Datax data collection actual combat
Official documents:
Go to the download address on GitHub to download DataX, or pull the source code to compile:
Upload the downloaded installation package to the server:
[root@hadoop ~]# cd /usr/local/src
[root@hadoop /usr/local/src]# ls |grep datax.tar.gz
datax.tar.gz
[root@hadoop /usr/local/src]#
Unzip the installation package to a suitable directory:
[root@hadoop /usr/local/src]# tar -zxvf datax.tar.gz -C /usr/local
[root@hadoop /usr/local/src]# cd ../datax/
[root@hadoop /usr/local/datax]# ls
bin conf job lib plugin script tmp
[root@hadoop /usr/local/datax]#
Execute DataX's self-check script:
[root@hadoop /usr/local/datax]# python bin/datax.py job/job.json
...
任务启动时刻 : 2020-11-13 11:21:01
任务结束时刻 : 2020-11-13 11:21:11
任务总计耗时 : 10s
任务平均流量 : 253.91KB/s
记录写入速度 : 10000rec/s
读出记录总数 : 100000
读写失败总数 : 0
Import CSV file data into Hive
After the detection is no problem, let's briefly demonstrate how to import the data in the CSV file into Hive . We need to use hdfswriter and txtfilereader . Official documents:
- https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md
- https://github.com/alibaba/DataX/blob/master/txtfilereader/doc/txtfilereader.md
First, create a database in Hive:
0: jdbc:hive2://localhost:10000> create database db01;
No rows affected (0.315 seconds)
0: jdbc:hive2://localhost:10000> use db01;
Then create a table:
create table log_dev2(
id int,
name string,
create_time int,
creator string,
info string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as orcfile;
When the library and table are created, there will be corresponding directory files in HDFS:
[root@hadoop ~]# hdfs dfs -ls /user/hive/warehouse/db01.db
Found 1 items
drwxr-xr-x - root supergroup 0 2020-11-13 11:30 /user/hive/warehouse/db01.db/log_dev2
[root@hadoop ~]#
Prepare test data:
[root@hadoop ~]# cat datax/db.csv
1,创建用户,1554099545,hdfs,创建用户 test
2,更新用户,1554099546,yarn,更新用户 test1
3,删除用户,1554099547,hdfs,删除用户 test2
4,更新用户,1554189515,yarn,更新用户 test3
5,删除用户,1554199525,hdfs,删除用户 test4
6,创建用户,1554299345,yarn,创建用户 test5
DataX defines ETL tasks through a configuration file in json format, and creates a json file:, the vim csv2hive.json
content of the ETL task we want to define is as follows:
{
"setting":{
},
"job":{
"setting":{
"speed":{
"channel":2
}
},
"content":[
{
"reader":{
"name":"txtfilereader",
"parameter":{
"path":[
"/root/datax/db.csv"
],
"encoding":"UTF-8",
"column":[
{
"index":0,
"type":"long"
},
{
"index":1,
"type":"string"
},
{
"index":2,
"type":"long"
},
{
"index":3,
"type":"string"
},
{
"index":4,
"type":"string"
}
],
"fieldDelimiter":","
}
},
"writer":{
"name":"hdfswriter",
"parameter":{
"defaultFS":"hdfs://192.168.243.161:8020",
"fileType":"orc",
"path":"/user/hive/warehouse/db01.db/log_dev2",
"fileName":"log_dev2.csv",
"column":[
{
"name":"id",
"type":"int"
},
{
"name":"name",
"type":"string"
},
{
"name":"create_time",
"type":"INT"
},
{
"name":"creator",
"type":"string"
},
{
"name":"info",
"type":"string"
}
],
"writeMode":"append",
"fieldDelimiter":",",
"compress":"NONE"
}
}
}
]
}
}
- datax uses json as the configuration file, the file can be local or remote http server
- The outermost layer of the json configuration file is one
job
,job
includingsetting
andcontent
two parts, which aresetting
used tojob
configure the whole andcontent
are the source and purpose of the data setting
: Used to set the global channe| configuration, dirty data configuration, speed limit configuration, etc. In this example, only the channel number 1 is configured, which means that data transmission is performed using a single threadcontent
:- reader: configure where to read data from
name
: The name of the plug-in must be consistent with the name of the plug-in in the projectparameter
: Input parameters corresponding to the plug-inpath
: The path of the source data fileencoding
: Data encodingfieldDelimiter
: Data separatorcolumn
: The position and data type of the source data after being separated by the separator
- writer: configure where to write data
name
: The plug-in name, which needs to be consistent with the plug-in name in the projectparameter
: Input parameters corresponding to the plug-inpath
:Target pathfileName
: Target file name prefixwriteMode
: How to write to the target directory
- reader: configure where to read data from
Execute our defined ETL tasks through DataX's Python script:
[root@hadoop ~]# python /usr/local/datax/bin/datax.py datax/csv2hive.json
...
任务启动时刻 : 2020-11-15 11:10:20
任务结束时刻 : 2020-11-15 11:10:32
任务总计耗时 : 12s
任务平均流量 : 17B/s
记录写入速度 : 0rec/s
读出记录总数 : 6
读写失败总数 : 0
Check whether the corresponding data file already exists in HDFS:
[root@hadoop ~]# hdfs dfs -ls /user/hive/warehouse/db01.db/log_dev2
Found 1 items
-rw-r--r-- 3 root supergroup 825 2020-11-15 11:10 /user/hive/warehouse/db01.db/log_dev2/log_dev2.csv__f19a135d_6c22_4988_ae69_df39354acb1e
[root@hadoop ~]#
Go to Hive to verify that the imported data meets expectations:
0: jdbc:hive2://localhost:10000> use db01;
No rows affected (0.706 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+
| tab_name |
+-----------+
| log_dev2 |
+-----------+
1 row selected (0.205 seconds)
0: jdbc:hive2://localhost:10000> select * from log_dev2;
+--------------+----------------+-----------------------+-------------------+----------------+
| log_dev2.id | log_dev2.name | log_dev2.create_time | log_dev2.creator | log_dev2.info |
+--------------+----------------+-----------------------+-------------------+----------------+
| 1 | 创建用户 | 1554099545 | hdfs | 创建用户 test |
| 2 | 更新用户 | 1554099546 | yarn | 更新用户 test1 |
| 3 | 删除用户 | 1554099547 | hdfs | 删除用户 test2 |
| 4 | 更新用户 | 1554189515 | yarn | 更新用户 test3 |
| 5 | 删除用户 | 1554199525 | hdfs | 删除用户 test4 |
| 6 | 创建用户 | 1554299345 | yarn | 创建用户 test5 |
+--------------+----------------+-----------------------+-------------------+----------------+
6 rows selected (1.016 seconds)
0: jdbc:hive2://localhost:10000>
Import MySQL data into Hive
Next, let's demonstrate importing MySQL data into Hive. In order to achieve this function, we need to use mysqlreader to read data from MySQL. The official documents are as follows:
First, execute the following SQL to construct some test data:
CREATE DATABASE datax_test;
USE `datax_test`;
CREATE TABLE `dev_log` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`create_time` int(11) DEFAULT NULL,
`creator` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`info` varchar(2000) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1069 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
insert into `dev_log`(`id`,`name`,`create_time`,`creator`,`info`) values
(1,'创建用户',1554099545,'hdfs','创建用户 test'),
(2,'更新用户',1554099546,'yarn','更新用户 test1'),
(3,'删除用户',1554099547,'hdfs','删除用户 test2'),
(4,'更新用户',1554189515,'yarn','更新用户 test3'),
(5,'删除用户',1554199525,'hdfs','删除用户 test4'),
(6,'创建用户',1554299345,'yarn','创建用户 test5');
Then db01
create another table in the Hive database:
create table log_dev(
id int,
name string,
create_time int,
creator string,
info string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile;
Create the configuration file of the ETL task:
[root@hadoop ~]# vim datax/mysql2hive.json
The contents of the file are as follows:
{
"job":{
"setting":{
"speed":{
"channel":3
},
"errorLimit":{
"record":0,
"percentage":0.02
}
},
"content":[
{
"reader":{
"name":"mysqlreader",
"parameter":{
"username":"root",
"password":"123456a.",
"column":[
"id",
"name",
"create_time",
"creator",
"info"
],
"where":"creator='${creator}' and create_time>${create_time}",
"connection":[
{
"table":[
"dev_log"
],
"jdbcUrl":[
"jdbc:mysql://192.168.1.11:3306/datax_test?serverTimezone=Asia/Shanghai"
]
}
]
}
},
"writer":{
"name":"hdfswriter",
"parameter":{
"defaultFS":"hdfs://192.168.243.161:8020",
"fileType":"text",
"path":"/user/hive/warehouse/db01.db/log_dev",
"fileName":"log_dev3.csv",
"column":[
{
"name":"id",
"type":"int"
},
{
"name":"name",
"type":"string"
},
{
"name":"create_time",
"type":"INT"
},
{
"name":"creator",
"type":"string"
},
{
"name":"info",
"type":"string"
}
],
"writeMode":"append",
"fieldDelimiter":",",
"compress":"GZIP"
}
}
}
]
}
}
- mysqlreader supports incoming
where
conditions to filter the data that needs to be read. The specific parameters can be passed in when executing the datax script. We can achieve incremental synchronization support through this variable substitution method
The default driver package of mysqlreader is 5.x. Since my MySQL version is 8.x, I need to replace the driver package in mysqlreader:
[root@hadoop ~]# cp /usr/local/src/mysql-connector-java-8.0.21.jar /usr/local/datax/plugin/reader/mysqlreader/libs/
[root@hadoop ~]# rm -rf /usr/local/datax/plugin/reader/mysqlreader/libs/mysql-connector-java-5.1.34.jar
Then execute the ETL task:
[root@hadoop ~]# python /usr/local/datax/bin/datax.py datax/mysql2hive.json -p "-Dcreator=yarn -Dcreate_time=1554099547"
...
任务启动时刻 : 2020-11-15 11:38:14
任务结束时刻 : 2020-11-15 11:38:25
任务总计耗时 : 11s
任务平均流量 : 5B/s
记录写入速度 : 0rec/s
读出记录总数 : 2
读写失败总数 : 0
Check whether the corresponding data file already exists in HDFS:
[root@hadoop ~]# hdfs dfs -ls /user/hive/warehouse/db01.db/log_dev
Found 1 items
-rw-r--r-- 3 root supergroup 84 2020-11-15 11:38 /user/hive/warehouse/db01.db/log_dev/log_dev3.csv__d142f3ee_126e_4056_af49_b56e45dec1ef.gz
[root@hadoop ~]#
Go to Hive to verify that the imported data meets expectations:
0: jdbc:hive2://localhost:10000> select * from log_dev;
+-------------+---------------+----------------------+------------------+---------------+
| log_dev.id | log_dev.name | log_dev.create_time | log_dev.creator | log_dev.info |
+-------------+---------------+----------------------+------------------+---------------+
| 4 | 更新用户 | 1554189515 | yarn | 更新用户 test3 |
| 6 | 创建用户 | 1554299345 | yarn | 创建用户 test5 |
+-------------+---------------+----------------------+------------------+---------------+
2 rows selected (0.131 seconds)
0: jdbc:hive2://localhost:10000>
Introduction to Data Governance
The problems faced after collecting data to the warehouse:
- Compared with the traditional data warehouse big data era, the data is more diverse, more complex, and the amount of data is larger
- Data inconsistency everywhere, difficult to improve data quality, and difficult to complete data model combing
- Multiple collection tools and multiple storage methods make data warehouse or data lake gradually become a data swamp
Problems to be solved in data governance:
- Data agnostic: users don’t know what data they have, and don’t know the relationship between data and business
- Uncontrollable data: Without a unified data standard, data cannot be integrated and unified
- Undesirable data: the user cannot easily obtain the data, or the obtained data is not available
- Data cannot be connected: the relationship between data is not reflected, and the deep value of data cannot be reflected
The goals of data governance:
- Establish unified data standards and data specifications to ensure data quality
- Develop data management procedures to control the entire life cycle of data
- Form a platform tool for users to use
Data governance:
- Data governance includes metadata management, data quality management, data blood relationship management, etc.
- Data governance is in data collection, data cleaning, data calculation and other links
- Data governance is not a technology, but process, collaboration and management
Metadata management:
- Schema information such as the database table structure of the management data
- Data storage space, read and write records, permission attribution and other various statistical information
Data blood relationship management:
- Blood relationship and life cycle between data
- The data of table B is collected from table A, so table B and A have blood relationship
- Data business attribute information and business data model
Brief description of data governance steps:
- Unify data specifications and data definitions, open up business models and technical models
- Improve data quality and realize data lifecycle management
- Mining the value of data to help business personnel use data conveniently and flexibly
Data governance and peripheral systems:
- ODS, DWD, DM and other levels of metadata are incorporated into the data governance platform for centralized management
- The metadata generated in the data collection and processing process is incorporated into the data governance platform, and blood relationship is established
- Provide data management service interface, and timely notify upstream and downstream of data model changes
Apache Atlas data governance
Common data governance tools:
- Apache Atlas : Data governance open source project promoted by Hortonworks
- Metacat : Netflix's open source metadata management and data discovery components
- Navigator : Data management solution provided by Cloudera
- WhereHows : A data management solution used internally by LinkedIn and open source
Apache Altas :
- Data classification: automatically capture, define and annotate metadata, and classify data business-oriented
- Centralized audit: capture access information of all steps, applications and data interactions
- Search and blood relationship: based on classification and audit of the relationship between related data and data, and displayed through visualization
Apache Altas architecture diagram:
- Type System: An entity abstracted from metadata objects to be managed, composed of types
- Ingest\Export: A tool for automatic collection and export of metadata. Export can be triggered as an event so that users can respond in time
- Graph Engine: Show the relationship between data through graph database and graph calculation engine
Metadata capture:
- Hook: Hook from each component automatically captures data for storage
- Entity: Each integrated system triggers events for writing during operation
- While obtaining metadata, obtain the relationship between data and build blood relationship