3.2.3 Sqoop data migration tool, import data, MySQL to HDFS/Hive, export data, incremental data import, Sqoop job, common commands and parameters

table of Contents

Data Migration Tool - Sqoop

The first part of Sqoop overview

The second part of installation and configuration

Part Three Application Case

Section 1 Import data import

MySQL to HDFS

MySQL to Hive

Section 2 export data export

Section 3 Incremental Data Import

Change Data Capture (CDC)

Append way

Section 4 execute job

Part IV Common Commands and Parameters

Section 1 Common Commands

Section 2 Common Parameters

Common parameters-database connection

Common parameters - import

Common parameters - export

Common parameters - hive

import parameter

export parameters


 

Data Migration Tool - Sqoop

The first part of Sqoop overview

Sqoop is an open source tool, mainly used to transfer data between Hadoop (Hive) and traditional databases (mysql, postgresql, etc.). You can import data from relational databases (MySQL, Oracle, Postgres, etc.) into HDFS, and you can also import data from HDFS into relational databases.

The Sqoop project started in 2009. It first existed as a third-party module of Hadoop. Later, in order to allow users to quickly deploy and to allow developers to iteratively develop faster, Sqoop independently became an Apache project.

 

The second part of installation and configuration

Sqoop official website: http://sqoop.apache.org/
Sqoop download address: http://www.apache.org/dyn/closer.lua/sqoop/

 

1. Download, upload and unzip

Upload the downloaded installation package sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz to the virtual machine; unzip the package;

tar zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
mv sqoop-1.4.7.bin__hadoop-2.6.0/ ../servers/sqoop-1.4.7/

2. Increase environment variables and make them effective

vi /etc/profile

# Add the following content
export SQOOP_HOME=/opt/lagou/servers/sqoop-1.4.7
export PATH=$PATH:$SQOOP_HOME/bin

source /etc/profile

3. Create and modify configuration files

# Configuration file location $SQOOP_HOME/conf; the configuration file to be modified is sqoop-env.sh
cp sqoop-env-template.sh sqoop-env.sh
vi sqoop-env.sh

# Add the following content at the end of the file
export HADOOP_COMMON_HOME=/opt/lagou/servers/hadoop-2.9.2
export HADOOP_MAPRED_HOME=/opt/lagou/servers/hadoop-2.9.2
export HIVE_HOME=/opt/lagou/servers/hive-2.3 .7

4. Copy the JDBC driver

# Copy the jdbc driver to the lib directory of sqoop (note: soft link can also be established)
ln -s /opt/lagou/servers/hive-2.3.7/lib/mysql-connector-java-5.1.46.jar /opt /lagou/servers/sqoop-1.4.7/lib/

5. Copy jar. Copy
hive-common-2.3.7.jar under $HIVE_HOME/lib to the $SQOOP_HOME/lib directory. If you do not copy the data from MySQL to Hive, an error will occur: ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf

# Both hard copy and soft link can be established, just choose one to execute. The following is a hard copy
cp $HIVE_HOME/lib/hive-common-2.3.7.jar $SQOOP_HOME/lib/

# Establish a soft link
ln -s /opt/lagou/servers/hive-2.3.7/lib/hive-common-2.3.7.jar /opt/lagou/servers/sqoop-1.4.7/lib/hive-common- 2.3.7.jar

Copy $HADOOP_HOME/share/hadoop/tools/lib/json-20170516.jar to the $SQOOP_HOME/lib/ directory; otherwise, when creating a sqoop job, it will report: java.lang.NoClassDefFoundError: org/json/JSONObject

cp $HADOOP_HOME/share/hadoop/tools/lib/json-20170516.jar $SQOOP_HOME/lib/

6. Installation verification

[root@linux123 ~]# sqoop version

...

Omit warning

...
20/06/19 10:37:24 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017

 

# Test whether Sqoop can successfully connect to the database
[root@linux123 ~]# sqoop list-databases --connect jdbc:mysql://linux123:3306/?useSSL=false --username hive --password 12345678

 

Warning: ...

Omit warning

...
information_schema
hivemetadata
mysql
performance_schema
sys


Part Three Application Case

In Sqoop

  • Importing refers to: transferring data from relational databases to big data clusters (HDFS, HIVE, HBASE); use the import keyword;
  • Export refers to: transfer data from a big data cluster to a relational database; use the export keyword;

Test data script

-- 用于在 Mysql 中生成测试数据
CREATE DATABASE sqoop;

use sqoop;

CREATE TABLE sqoop.goodtbl(
gname varchar(50),
serialNumber int,
price int,
stock_number int,
create_time date);


DROP FUNCTION IF EXISTS `rand_string`;
DROP PROCEDURE IF EXISTS `batchInsertTestData`;


-- 替换语句默认的执行符号,将;替换成 //
DELIMITER //

CREATE FUNCTION `rand_string` (n INT) RETURNS VARCHAR(255)
CHARSET 'utf8'
BEGIN
   DECLARE char_str VARCHAR(200) DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ';
   DECLARE return_str VARCHAR(255) DEFAULT '';
   DECLARE i INT DEFAULT 0;
   WHILE i < n DO
        SET return_str = CONCAT(return_str, SUBSTRING(char_str, FLOOR(1 + RAND()*36), 1));
        SET i = i+1;
   END WHILE;
   RETURN return_str;
END
//


-- 第一个参数表示:序号从几开始;第二个参数表示:插入多少条记录
CREATE PROCEDURE `batchInsertTestData` (m INT, n INT)
BEGIN
   DECLARE i INT DEFAULT 0;
   WHILE i < n DO
      INSERT INTO goodtbl (gname, serialNumber, price, stock_number, create_time)
      VALUES (rand_string(6), i+m, ROUND(RAND()*100), FLOOR(RAND()*100), NOW());
      SET i = i+1;
   END WHILE;
END
//

DELIMITER ;

call batchInsertTestData(1, 100);

 

Section 1 Import data import

MySQL to HDFS

1. Import all data

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl \
--target-dir /root/lagou \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"

Remarks:

  • target-dir: the path to import data into HDFS;
  • delete-target-dir: If the target folder already exists on HDFS, an error will be reported when you run it again. You can use --delete-target-dir to delete the directory first. You can also use the append parameter to indicate additional data;
  • num-mappers: How many Map Tasks are started; 4 Map Tasks are started by default; it can also be written as -m 1
  • fields-terminated-by: separator of data in HDFS file;

 

2. Import query data

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--append \
-m 1 \
--fields-terminated-by "\t" \
--query 'select gname, serialNumber, price, stock_number,
create_time from goodtbl where price>88 and $CONDITIONS order by price;'

Remarks:

  • The where clause of the query statement must contain'$CONDITIONS'
  • If double quotes are used after the query, a transfer character must be added before $CONDITIONS to prevent the shell from identifying its own variables

3. Import the specified column

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--columns gname,serialNumber,price \
--table goodtbl

Remarks: If multiple columns are involved in the columns, separate them with commas and do not add spaces

4. Import query data (use keywords)

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--delete-target-dir \
-m 1 \
--fields-terminated-by "\t" \
--table goodtbl \
--where "price>=68"

5. Start multiple Map
Tasks to import data . Add data to goodtbl: call batchInsertTestData(1000000);

# 给 goodtbl 表增加主键
alter table goodtbl add primary key(serialNumber);

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou/sqoop/5 \
--append \
--fields-terminated-by "\t" \
--table goodtbl \
--split-by serialNumber
sqoop import \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou/sqoop/5 \
--delete-target-dir \
--fields-terminated-by "\t" \
--table goodtbl \
--split-by gname

Note:
When using multiple Map Tasks for data import, sqoop must partition the data of each Task

  • If the table in MySQL has a primary key, just specify the number of Map Tasks
  • If the table in MySQL has a primary key, use split-by to specify the partition field
  • If the partition field is a character type, add it when using the sqoop command: -Dorg.apache.sqoop.splitter.allow_text_splitter=true. which is
sqoop import -
Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://liunx:3306/sqoop \
... ...
  • The'$CONDITIONS' in the where clause of the query statement is also used for data partitioning, even if there is only one Map Task

 

MySQL to Hive

Create a table in hive:

CREATE TABLE mydb.goodtbl(
gname string,
serialNumber int,
price int,
stock_number int,
create_time date);
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table mydb.goodtbl \
-m 1

Parameter Description:

  • hive-import. Required parameters, specify import hive
  • hive-database. Hive library name (default value default)
  • hive-table。Hive表名
  • fields-terminated-by. Hive field separator
  • hive-overwrite. Overwrite existing data
  • create-hive-table. The hive table is created, but there may be errors in the table. It is not recommended to use this parameter, it is recommended to build the table in advance

 

Section 2 export data export

Enter the big data platform to import: import

Leave the big data platform export: export
Note: MySQL tables need to be created in advance

-- 提前创建表
CREATE TABLE sqoop.goodtbl2(
gname varchar(50),
serialNumber int,
price int,
stock_number int,
create_time date);

hive mydb.goodtbl → MySQL sqoop.goodtbl2

sqoop export \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl2 \
-m 4 \
--export-dir /user/hive/warehouse/mydb.db/goodtbl \
--input-fields-terminated-by "\t" 

 

Section 3 Incremental Data Import

Change Data Capture (CDC)

All the previous data imports were performed. If the amount of data is small, complete source data extraction is adopted; if the amount of source data is large, it is necessary to extract the changed data. This data extraction mode is called change data capture, or CDC (Change Data Capture) for short.

 

CDC is roughly divided into two types: invasive and non-invasive. Intrusive refers to the performance impact of CDC operations on the source system. As long as the CDC operations perform SQL operations on the source database in any way, it is considered intrusive.

The four commonly used CDC methods are (the first three are invasive):

CDC based on timestamp. The extraction process can determine which data is incremental based on certain attribute columns. The most common attribute columns are as follows:

  • Timestamp: It is better to have two columns, one insert timestamp, which indicates when it was created, and one update timestamp, which indicates the time of the last update;
  • Sequence: Most databases provide self-increment function, the column in the table is defined as self-increment, it is easy to identify the newly inserted data according to the column;

This method is the simplest and most commonly used, but it has the following disadvantages:

  • Cannot record delete record operation
  • Unrecognized multiple updates
  • No real-time capability

 

Trigger-based CDC. When SQL statements such as INSERT, UPDATE, and DELETE are executed, triggers in the database are activated. The triggers can be used to capture the changed data and save the data in the intermediate temporary table. Then these changed data are retrieved from the temporary table. In most cases, it is not allowed to add triggers to the operational database, and this method will reduce system performance and will not be adopted;

CDC based on snapshots. The data changes can be obtained by comparing the source table and the snapshot table. The snapshot-based CDC can detect inserted, updated, and deleted data, which is an advantage over the time stamp-based CDC scheme. The disadvantage is that it requires a lot of storage space to save the snapshot.

Log-based CDC. The most complex and non-invasive CDC method is the log-based method. The database will record every insert, update, and delete operation in the log. By parsing the log file, you can obtain relevant information. The log format of each relational database is inconsistent, and there is no universal product. Alibaba's canal can complete MySQL log file analysis.

There are two ways to import data incrementally:

  • Incremental data import based on incremental columns (Append method)
  • Incremental data import based on time column (LastModified method)

Append way

1. Prepare initial data

-- 删除 MySQL 表中的全部数据
truncate table sqoop.goodtbl;

-- 删除 Hive 表中的全部数据
truncate table mydb.goodtbl;

-- 向MySQL的表中插入100条数据
call batchInsertTestData(1, 100);

2. Import data into Hive

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive --password 12345678 \
--table goodtbl \
--incremental append \
--hive-import \
--fields-terminated-by "\t" \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 0 \
-m 1

Parameter Description:

  • check-column is used to specify some columns (that is, you can specify multiple columns). These columns are used to check whether the data is imported as incremental data during incremental import, which is similar to auto-increment fields and timestamps in relational databases . The type of these specified columns cannot be any character type, such as char, varchar, etc.
  • last-value specifies the maximum value of the specified field in the check column during the last import

3. Check whether there is data in the hive table and how many pieces of data are there

4. Add 1000 more data to MySQL, the number starts from 200

call batchInsertTestData(200, 1000);

5. Perform incremental import again to import the data from MySQL into Hive; at this time, change the last-value to 100

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive --password 12345678 \
--table goodtbl \
--incremental append \
--hive-import \
--fields-terminated-by "\t" \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 100 \
-m 1

6. Check whether there is data in the hive table and how many pieces of data are there

 

Section 4 execute job

There are two ways to implement incremental data import:

  1. Manually configure the last-value each time, manual scheduling
  2. Using job, given initial last-value, timed tasks are scheduled regularly every day

Obviously method 2 is easier.

1. Create a password file

echo -n "12345678" > sqoopPWD.pwd
hdfs dfs -mkdir -p /sqoop/pwd
hdfs dfs -put sqoopPWD.pwd /sqoop/pwd
hdfs dfs -chmod 400 /sqoop/pwd/sqoopPWD.pwd

# 可以在 sqoop 的 job 中增加:
--password-file /sqoop/pwd/sqoopPWD.pwd

2. Create sqoop job

# 创建 sqoop job
sqoop job --create myjob1 -- import \
--connect jdbc:mysql://linux123:3306/sqoop?useSSL=false \
--username hive \
--password-file /sqoop/pwd/sqoopPWD.pwd \
--table goodtbl \
--incremental append \
--hive-import \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 0 \
-m 1

# 查看已创建的job
sqoop job --list

# 查看job详细运行时参数
sqoop job --show myjob1

# 执行job
sqoop job --exec myjob1

# 删除job
sqoop job --delete myjob1

3. Execute job

sqoop job -exec myjob1

4. View data

Implementation principle:
After the job is executed, the maximum value of the current check-column will be recorded in the meta, and this value will be assigned to the last-value when the job is called up next time.

By default metadata is saved in ~/.sqoop/

Among them, the metastore.db.script file records the update operation to last-value:

cat metastore.db.script |grep incremental.last.value

 

Part IV Common Commands and Parameters

Section 1 Common Commands

 

Section 2 Common Parameters

The so-called public parameters are the parameters supported by most commands.

Common parameters-database connection

 

Common parameters - import


 

Common parameters - export

 

Common parameters - hive


 

import parameter

 

export parameters

 

Guess you like

Origin blog.csdn.net/chengh1993/article/details/112060391