Big data platform-data collection and governance

Introduction to data collection

ETL is basically a representative of data collection, including data extraction (Extract), transformation (Transform) and loading (Load). The data source is the upstream of the entire big data platform, and data collection is the pipeline between the data source and the data warehouse. In the collection process, the data is managed according to the business scenario and the data cleaning is completed.

In the big data scenario, data sources are complex and diverse, including business databases, log data, pictures, videos and other multimedia data. The form of data collection also needs to be more complex and diverse, including timing, real-time, incremental, full volume, etc. Common data collection tools are also diverse, which can meet a variety of business needs.

A typical data loading architecture:
Big data platform-data collection and governance

Three common data collection scenarios:

  • Scenario 1: Obtain data from data sources that support FTP, SFTP, HTTP and other protocols
  • Scenario 2: Obtain data from the business database, and support the business system after data collection and entry
  • Scenario 3: The data source needs to collect data in real time through message queues such as Kafka

Data collection system requirements:

  • Data source management and status monitoring
  • Multi-mode data acquisition and task monitoring such as timing, real-time, full, incremental, etc.
  • Metadata management, data supplementary acquisition and data archiving

Common data collection tools

Sqoop

Sqoop is a commonly used data import and export tool between relational databases and HDFS, which translates import or export commands into MapReduce programs. Therefore, it is often used to transfer data between Hadoop and traditional databases (Mysq|, Postgresq|, etc.).

Data can be imported from relational database to Hadoop cluster through Hadoop MapReduce. The process of using Sqoop to transfer large amounts of structured or semi-structured data is completely automated.

Sqoop data transmission diagram:
Big data platform-data collection and governance

Sqoop Import process:
Big data platform-data collection and governance

  • Get the MetaData information of the source data table
  • Submit MapReduce tasks based on parameters
  • Each row in the table is used as a record, and data is imported as planned

**Sqoop Export process:***
Big data platform-data collection and governance

  • Get the MetaData information of the target data table
  • Submit MapReduce tasks based on parameters
  • Split each line of data in the HDFS file by specified characters and export to the database

Apache Flume

Apache Flume is essentially a distributed, reliable, and highly available log collection system, supporting multiple data sources and flexible configuration. Flume can collect, aggregate and transmit massive logs.

The Flume system is divided into three components, namely Source (responsible for reading data sources), Sink (responsible for data output), and Channel (as a temporary storage channel for data). These three components will form an Agent. Flume allows users to build a complex data flow, such as data flowing through multiple agents and finally landing.

Flume data transmission diagram:
Big data platform-data collection and governance

Schematic diagram of data transmission in Flume with multiple data sources and multiple agents:
Big data platform-data collection and governance

Schematic diagram of data transmission under Flume multi-Sink multi-Agent:
Big data platform-data collection and governance

For the practical content of Flume, please refer to:

DataX

Official documents:

DataX is Ali's open source offline synchronization tool for heterogeneous data sources. It is committed to achieving efficient and stable data synchronization functions between various heterogeneous data sources such as relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc. DataX turns the complex mesh synchronization link into a star data synchronization link, which has good scalability.

Comparison chart of mesh synchronization link and DataX star data synchronization link:
Big data platform-data collection and governance

Schematic diagram of DataX architecture:
Big data platform-data collection and governance


Datax data collection actual combat

Official documents:

Go to the download address on GitHub to download DataX, or pull the source code to compile:

Upload the downloaded installation package to the server:

[root@hadoop ~]# cd /usr/local/src
[root@hadoop /usr/local/src]# ls |grep datax.tar.gz 
datax.tar.gz
[root@hadoop /usr/local/src]# 

Unzip the installation package to a suitable directory:

[root@hadoop /usr/local/src]# tar -zxvf datax.tar.gz -C /usr/local
[root@hadoop /usr/local/src]# cd ../datax/
[root@hadoop /usr/local/datax]# ls
bin  conf  job  lib  plugin  script  tmp
[root@hadoop /usr/local/datax]# 

Execute DataX's self-check script:

[root@hadoop /usr/local/datax]# python bin/datax.py job/job.json
...

任务启动时刻                    : 2020-11-13 11:21:01
任务结束时刻                    : 2020-11-13 11:21:11
任务总计耗时                    :                 10s
任务平均流量                    :          253.91KB/s
记录写入速度                    :          10000rec/s
读出记录总数                    :              100000
读写失败总数                    :                   0

Import CSV file data into Hive

After the detection is no problem, let's briefly demonstrate how to import the data in the CSV file into Hive . We need to use hdfswriter and txtfilereader . Official documents:

First, create a database in Hive:

0: jdbc:hive2://localhost:10000> create database db01;
No rows affected (0.315 seconds)
0: jdbc:hive2://localhost:10000> use db01;

Then create a table:

create table log_dev2(
    id int,
    name string,
    create_time int,
    creator string,
    info string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as orcfile;

When the library and table are created, there will be corresponding directory files in HDFS:

[root@hadoop ~]# hdfs dfs -ls /user/hive/warehouse/db01.db
Found 1 items
drwxr-xr-x   - root supergroup          0 2020-11-13 11:30 /user/hive/warehouse/db01.db/log_dev2
[root@hadoop ~]# 

Prepare test data:

[root@hadoop ~]# cat datax/db.csv
1,创建用户,1554099545,hdfs,创建用户 test
2,更新用户,1554099546,yarn,更新用户 test1
3,删除用户,1554099547,hdfs,删除用户 test2
4,更新用户,1554189515,yarn,更新用户 test3
5,删除用户,1554199525,hdfs,删除用户 test4
6,创建用户,1554299345,yarn,创建用户 test5

DataX defines ETL tasks through a configuration file in json format, and creates a json file:, the vim csv2hive.jsoncontent of the ETL task we want to define is as follows:

{
    "setting":{

    },
    "job":{
        "setting":{
            "speed":{
                "channel":2
            }
        },
        "content":[
            {
                "reader":{
                    "name":"txtfilereader",
                    "parameter":{
                        "path":[
                            "/root/datax/db.csv"
                        ],
                        "encoding":"UTF-8",
                        "column":[
                            {
                                "index":0,
                                "type":"long"
                            },
                            {
                                "index":1,
                                "type":"string"
                            },
                            {
                                "index":2,
                                "type":"long"
                            },
                            {
                                "index":3,
                                "type":"string"
                            },
                            {
                                "index":4,
                                "type":"string"
                            }
                        ],
                        "fieldDelimiter":","
                    }
                },
                "writer":{
                    "name":"hdfswriter",
                    "parameter":{
                        "defaultFS":"hdfs://192.168.243.161:8020",
                        "fileType":"orc",
                        "path":"/user/hive/warehouse/db01.db/log_dev2",
                        "fileName":"log_dev2.csv",
                        "column":[
                            {
                                "name":"id",
                                "type":"int"
                            },
                            {
                                "name":"name",
                                "type":"string"
                            },
                            {
                                "name":"create_time",
                                "type":"INT"
                            },
                            {
                                "name":"creator",
                                "type":"string"
                            },
                            {
                                "name":"info",
                                "type":"string"
                            }
                        ],
                        "writeMode":"append",
                        "fieldDelimiter":",",
                        "compress":"NONE"
                    }
                }
            }
        ]
    }
}
  1. datax uses json as the configuration file, the file can be local or remote http server
  2. The outermost layer of the json configuration file is one job, jobincluding settingand contenttwo parts, which are settingused to jobconfigure the whole and contentare the source and purpose of the data
  3. setting: Used to set the global channe| configuration, dirty data configuration, speed limit configuration, etc. In this example, only the channel number 1 is configured, which means that data transmission is performed using a single thread
  4. content
    • reader: configure where to read data from
      • name: The name of the plug-in must be consistent with the name of the plug-in in the project
      • parameter: Input parameters corresponding to the plug-in
      • path: The path of the source data file
      • encoding: Data encoding
      • fieldDelimiter: Data separator
      • column: The position and data type of the source data after being separated by the separator
    • writer: configure where to write data
      • name: The plug-in name, which needs to be consistent with the plug-in name in the project
      • parameter: Input parameters corresponding to the plug-in
      • path:Target path
      • fileName: Target file name prefix
      • writeMode: How to write to the target directory

Execute our defined ETL tasks through DataX's Python script:

[root@hadoop ~]# python /usr/local/datax/bin/datax.py datax/csv2hive.json
...

任务启动时刻                    : 2020-11-15 11:10:20
任务结束时刻                    : 2020-11-15 11:10:32
任务总计耗时                    :                 12s
任务平均流量                    :               17B/s
记录写入速度                    :              0rec/s
读出记录总数                    :                   6
读写失败总数                    :                   0

Check whether the corresponding data file already exists in HDFS:

[root@hadoop ~]# hdfs dfs -ls /user/hive/warehouse/db01.db/log_dev2
Found 1 items
-rw-r--r--   3 root supergroup        825 2020-11-15 11:10 /user/hive/warehouse/db01.db/log_dev2/log_dev2.csv__f19a135d_6c22_4988_ae69_df39354acb1e
[root@hadoop ~]# 

Go to Hive to verify that the imported data meets expectations:

0: jdbc:hive2://localhost:10000> use db01;
No rows affected (0.706 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+
| tab_name  |
+-----------+
| log_dev2  |
+-----------+
1 row selected (0.205 seconds)
0: jdbc:hive2://localhost:10000> select * from log_dev2;
+--------------+----------------+-----------------------+-------------------+----------------+
| log_dev2.id  | log_dev2.name  | log_dev2.create_time  | log_dev2.creator  | log_dev2.info  |
+--------------+----------------+-----------------------+-------------------+----------------+
| 1            | 创建用户         | 1554099545         | hdfs              | 创建用户 test      |
| 2            | 更新用户         | 1554099546         | yarn              | 更新用户 test1     |
| 3            | 删除用户         | 1554099547         | hdfs              | 删除用户 test2     |
| 4            | 更新用户         | 1554189515         | yarn              | 更新用户 test3     |
| 5            | 删除用户         | 1554199525         | hdfs              | 删除用户 test4     |
| 6            | 创建用户         | 1554299345         | yarn              | 创建用户 test5     |
+--------------+----------------+-----------------------+-------------------+----------------+
6 rows selected (1.016 seconds)
0: jdbc:hive2://localhost:10000> 

Import MySQL data into Hive

Next, let's demonstrate importing MySQL data into Hive. In order to achieve this function, we need to use mysqlreader to read data from MySQL. The official documents are as follows:

First, execute the following SQL to construct some test data:

CREATE DATABASE datax_test;

USE `datax_test`;

CREATE TABLE `dev_log` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `create_time` int(11) DEFAULT NULL,
  `creator` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `info` varchar(2000) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1069 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

insert  into `dev_log`(`id`,`name`,`create_time`,`creator`,`info`) values 
(1,'创建用户',1554099545,'hdfs','创建用户 test'),
(2,'更新用户',1554099546,'yarn','更新用户 test1'),
(3,'删除用户',1554099547,'hdfs','删除用户 test2'),
(4,'更新用户',1554189515,'yarn','更新用户 test3'),
(5,'删除用户',1554199525,'hdfs','删除用户 test4'),
(6,'创建用户',1554299345,'yarn','创建用户 test5');

Then db01create another table in the Hive database:

create table log_dev(
    id int,
    name string,
    create_time int,
    creator string,
    info string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile;

Create the configuration file of the ETL task:

[root@hadoop ~]# vim datax/mysql2hive.json

The contents of the file are as follows:

{
    "job":{
        "setting":{
            "speed":{
                "channel":3
            },
            "errorLimit":{
                "record":0,
                "percentage":0.02
            }
        },
        "content":[
            {
                "reader":{
                    "name":"mysqlreader",
                    "parameter":{
                        "username":"root",
                        "password":"123456a.",
                        "column":[
                            "id",
                            "name",
                            "create_time",
                            "creator",
                            "info"
                        ],
                        "where":"creator='${creator}' and create_time>${create_time}",
                        "connection":[
                            {
                                "table":[
                                    "dev_log"
                                ],
                                "jdbcUrl":[
                                    "jdbc:mysql://192.168.1.11:3306/datax_test?serverTimezone=Asia/Shanghai"
                                ]
                            }
                        ]
                    }
                },
                "writer":{
                    "name":"hdfswriter",
                    "parameter":{
                        "defaultFS":"hdfs://192.168.243.161:8020",
                        "fileType":"text",
                        "path":"/user/hive/warehouse/db01.db/log_dev",
                        "fileName":"log_dev3.csv",
                        "column":[
                            {
                                "name":"id",
                                "type":"int"
                            },
                            {
                                "name":"name",
                                "type":"string"
                            },
                            {
                                "name":"create_time",
                                "type":"INT"
                            },
                            {
                                "name":"creator",
                                "type":"string"
                            },
                            {
                                "name":"info",
                                "type":"string"
                            }
                        ],
                        "writeMode":"append",
                        "fieldDelimiter":",",
                        "compress":"GZIP"
                    }
                }
            }
        ]
    }
}
  • mysqlreader supports incoming whereconditions to filter the data that needs to be read. The specific parameters can be passed in when executing the datax script. We can achieve incremental synchronization support through this variable substitution method

The default driver package of mysqlreader is 5.x. Since my MySQL version is 8.x, I need to replace the driver package in mysqlreader:

[root@hadoop ~]# cp /usr/local/src/mysql-connector-java-8.0.21.jar /usr/local/datax/plugin/reader/mysqlreader/libs/
[root@hadoop ~]# rm -rf /usr/local/datax/plugin/reader/mysqlreader/libs/mysql-connector-java-5.1.34.jar 

Then execute the ETL task:

[root@hadoop ~]# python /usr/local/datax/bin/datax.py datax/mysql2hive.json -p "-Dcreator=yarn -Dcreate_time=1554099547"
...

任务启动时刻                    : 2020-11-15 11:38:14
任务结束时刻                    : 2020-11-15 11:38:25
任务总计耗时                    :                 11s
任务平均流量                    :                5B/s
记录写入速度                    :              0rec/s
读出记录总数                    :                   2
读写失败总数                    :                   0

Check whether the corresponding data file already exists in HDFS:

[root@hadoop ~]# hdfs dfs -ls /user/hive/warehouse/db01.db/log_dev
Found 1 items
-rw-r--r--   3 root supergroup         84 2020-11-15 11:38 /user/hive/warehouse/db01.db/log_dev/log_dev3.csv__d142f3ee_126e_4056_af49_b56e45dec1ef.gz
[root@hadoop ~]# 

Go to Hive to verify that the imported data meets expectations:

0: jdbc:hive2://localhost:10000> select * from log_dev;
+-------------+---------------+----------------------+------------------+---------------+
| log_dev.id  | log_dev.name  | log_dev.create_time  | log_dev.creator  | log_dev.info  |
+-------------+---------------+----------------------+------------------+---------------+
| 4           | 更新用户        | 1554189515          | yarn             | 更新用户 test3  |
| 6           | 创建用户        | 1554299345          | yarn             | 创建用户 test5  |
+-------------+---------------+----------------------+------------------+---------------+
2 rows selected (0.131 seconds)
0: jdbc:hive2://localhost:10000> 

Introduction to Data Governance

The problems faced after collecting data to the warehouse:

  • Compared with the traditional data warehouse big data era, the data is more diverse, more complex, and the amount of data is larger
  • Data inconsistency everywhere, difficult to improve data quality, and difficult to complete data model combing
  • Multiple collection tools and multiple storage methods make data warehouse or data lake gradually become a data swamp

Problems to be solved in data governance:

  • Data agnostic: users don’t know what data they have, and don’t know the relationship between data and business
  • Uncontrollable data: Without a unified data standard, data cannot be integrated and unified
  • Undesirable data: the user cannot easily obtain the data, or the obtained data is not available
  • Data cannot be connected: the relationship between data is not reflected, and the deep value of data cannot be reflected

The goals of data governance:

  • Establish unified data standards and data specifications to ensure data quality
  • Develop data management procedures to control the entire life cycle of data
  • Form a platform tool for users to use

Data governance:

  • Data governance includes metadata management, data quality management, data blood relationship management, etc.
  • Data governance is in data collection, data cleaning, data calculation and other links
  • Data governance is not a technology, but process, collaboration and management

Metadata management:

  • Schema information such as the database table structure of the management data
  • Data storage space, read and write records, permission attribution and other various statistical information

Data blood relationship management:

  • Blood relationship and life cycle between data
  • The data of table B is collected from table A, so table B and A have blood relationship
  • Data business attribute information and business data model

Brief description of data governance steps:

  • Unify data specifications and data definitions, open up business models and technical models
  • Improve data quality and realize data lifecycle management
  • Mining the value of data to help business personnel use data conveniently and flexibly

Data governance and peripheral systems:

  • ODS, DWD, DM and other levels of metadata are incorporated into the data governance platform for centralized management
  • The metadata generated in the data collection and processing process is incorporated into the data governance platform, and blood relationship is established
  • Provide data management service interface, and timely notify upstream and downstream of data model changes

Apache Atlas data governance

Common data governance tools:

  • Apache Atlas : Data governance open source project promoted by Hortonworks
  • Metacat : Netflix's open source metadata management and data discovery components
  • Navigator : Data management solution provided by Cloudera
  • WhereHows : A data management solution used internally by LinkedIn and open source

Apache Altas :

  • Data classification: automatically capture, define and annotate metadata, and classify data business-oriented
  • Centralized audit: capture access information of all steps, applications and data interactions
  • Search and blood relationship: based on classification and audit of the relationship between related data and data, and displayed through visualization

Apache Altas architecture diagram:
Big data platform-data collection and governance

  • Type System: An entity abstracted from metadata objects to be managed, composed of types
  • Ingest\Export: A tool for automatic collection and export of metadata. Export can be triggered as an event so that users can respond in time
  • Graph Engine: Show the relationship between data through graph database and graph calculation engine

Metadata capture:

  • Hook: Hook from each component automatically captures data for storage
  • Entity: Each integrated system triggers events for writing during operation
  • While obtaining metadata, obtain the relationship between data and build blood relationship

Guess you like

Origin blog.51cto.com/zero01/2551050