3.2.5 Data Migration Tool - Sqoop

Data Migration Tool – Sqoop



The first part of Sqoop overview

Sqoop is an open source tool mainly used to transfer data between Hadoop (Hive) and traditional databases (mysql, postgresql, etc.). You can import data from relational databases (MySQL, Oracle, Postgres, etc.) into HDFS, and you can also import data from HDFS into relational databases.
The Sqoop project started in 2009. It first existed as a third-party module of Hadoop. Later, in order to allow users to quickly deploy and to allow developers to iteratively develop faster, Sqoop independently became an Apache project.
Insert picture description here

The second part of installation and configuration

Sqoop official website: http://sqoop.apache.org/
Sqoop download address: http://www.apache.org/dyn/closer.lua/sqoop/
1. Download, upload and decompress
the downloaded installation package sqoop-1.4 .6.bin__hadoop-2.0.4-alpha.tar.gz upload to the virtual machine;
decompress the software package;

tar zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz 
mv sqoop-1.4.7.bin__hadoop-2.6.0/ ../servers/sqoop-1.4.7/

2. Increase environment variables and make them effective

vi /etc/profile 
# 增加以下内容 
export SQOOP_HOME=/opt/lagou/servers/sqoop-1.4.7 
export PATH=$PATH:$SQOOP_HOME/bin 
source /etc/profile

3. Create and modify configuration files

# 配置文件位置 $SQOOP_HOME/conf;要修改的配置文件为 sqoop-env.sh
cp sqoop-env-template.sh sqoop-env.sh
vi sqoop-env.sh

# 在文件最后增加以下内容
export HADOOP_COMMON_HOME=/opt/lagou/servers/hadoop-2.9.2
export HADOOP_MAPRED_HOME=/opt/lagou/servers/hadoop-2.9.2
export HIVE_HOME=/opt/lagou/servers/hive-2.3.7

4. Copy the JDBC driver

# 拷贝jdbc驱动到sqoop的lib目录下(备注:建立软链接也可以)
ln -s /opt/lagou/servers/hive-2.3.7/lib/mysql-connector-java-5.1.46.jar /opt/lagou/servers/sqoop-1.4.7/lib/

5, copy jar
the hive in HIVEHOME / lib - common -. 2.3.7 jar, copied to the hive-common-2.3.7.jar under HIVE_HOME / lib, copy toHIVEHO M E / L I B at the H I V Ecommon2 . . 3 . . 7 . J A R & lt , copy shell to the SQOOP_HOME / lib directory. If you do not copy the data from MySQL to Hive, an error will occur: ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf

# 硬拷贝 和 建立软链接都可以,选择一个执行即可。下面是硬拷贝
cp $HIVE_HOME/lib/hive-common-2.3.7.jar $SQOOP_HOME/lib/

# 建立软链接
ln -s /opt/lagou/servers/hive-2.3.7/lib/hive-common-2.3.7.jar /opt/lagou/servers/sqoop-1.4.7/lib/hive-common-2.3.7.jar

Copy HADOOPHOME / share / hadoop / tools / lib / json − 20170516.jar to HADOOP_HOME/share/hadoop/tools/lib/json-20170516.jar toHADOOPHOME/share/hadoop/tools/lib/json2 0 1 7 0 5 1 6 . J A r copy shellfish to lower SQOOP_HOME / lib / directory; otherwise, when creating sqoop job will be reported: java.lang.NoClassDefFoundError: org / json / JSONObject

cp $HADOOP_HOME/share/hadoop/tools/lib/json-20170516.jar $SQOOP_HOME/lib/

6. Installation verification

[root@linux123 ~]# sqoop version

...

省略警告

...
20/06/19 10:37:24 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017

 

# 测试Sqoop是否能够成功连接数据库
[root@linux123 ~]# sqoop list-databases --connect jdbc:mysql://linux123:3306/?useSSL=false --username hive --password 12345678

 

Warning: ...

省略警告

...
information_schema
hivemetadata
mysql
performance_schema
sys

The third part of the application case

In Sqoop

  • Importing refers to: transferring data from relational databases to big data clusters (HDFS, HIVE, HBASE);
    use the import keyword;
  • Export refers to: transfer data from a big data cluster to a relational database;
    use the export keyword;

Test data script

-- 用于在 Mysql 中生成测试数据
CREATE DATABASE sqoop;
 
use sqoop;
 
CREATE TABLE sqoop.goodtbl(
gname varchar(50),
serialNumber int,
price int,
stock_number int,
create_time date);
 
 
DROP FUNCTION IF EXISTS `rand_string`;
DROP PROCEDURE IF EXISTS `batchInsertTestData`;
 
 
-- 替换语句默认的执行符号,将;替换成 //
DELIMITER //
 
CREATE FUNCTION `rand_string` (n INT) RETURNS VARCHAR(255)
CHARSET 'utf8'
BEGIN
   DECLARE char_str VARCHAR(200) DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ';
   DECLARE return_str VARCHAR(255) DEFAULT '';
   DECLARE i INT DEFAULT 0;
   WHILE i < n DO
        SET return_str = CONCAT(return_str, SUBSTRING(char_str, FLOOR(1 + RAND()*36), 1));
        SET i = i+1;
   END WHILE;
   RETURN return_str;
END
//
 
 
-- 第一个参数表示:序号从几开始;第二个参数表示:插入多少条记录
CREATE PROCEDURE `batchInsertTestData` (m INT, n INT)
BEGIN
   DECLARE i INT DEFAULT 0;
   WHILE i < n DO
      INSERT INTO goodtbl (gname, serialNumber, price, stock_number, create_time)
      VALUES (rand_string(6), i+m, ROUND(RAND()*100), FLOOR(RAND()*100), NOW());
      SET i = i+1;
   END WHILE;
END
//
 
DELIMITER ;
 
call batchInsertTestData(1, 100);

Section 1 Import data import

MySQL to HDFS
1. Import all data

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl \
--target-dir /root/lagou \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"

Insert picture description here
2. Import query data

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--append \
-m 1 \
--fields-terminated-by "\t" \
--query 'select gname, serialNumber, price, stock_number,
create_time from goodtbl where price>88 and $CONDITIONS order by price;'

Insert picture description here
3. Import the specified column

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--columns gname,serialNumber,price \
--table goodtbl

Remarks: If multiple columns are involved in the columns, separate them with commas and cannot add spaces.
4. Import query data (use keywords)

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--delete-target-dir \
-m 1 \
--fields-terminated-by "\t" \
--table goodtbl \
--where "price>=68"

5. Start multiple Map
Tasks to import data . Add data to goodtbl: call batchInsertTestData(1000000);

# 给 goodtbl 表增加主键
alter table goodtbl add primary key(serialNumber);
 
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou/sqoop/5 \
--append \
--fields-terminated-by "\t" \
--table goodtbl \
--split-by serialNumber

Insert picture description here

MySQL to Hive

Create a table in hive:

CREATE TABLE mydb.goodtbl(
gname string,
serialNumber int,
price int,
stock_number int,
create_time date);
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table mydb.goodtbl \
-m 1

Insert picture description here

Section 2 Export Data

  • Enter the big data platform to import: import
  • Leave the big data platform to export: export
    Note: MySQL tables need to be created in advance
-- 提前创建表
CREATE TABLE sqoop.goodtbl2(
gname varchar(50),
serialNumber int,
price int,
stock_number int,
create_time date);
sqoop export \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl2 \
-m 4 \
--export-dir /user/hive/warehouse/mydb.db/goodtbl \
--input-fields-terminated-by "\t" 

Section 3 Incremental Data Import

Insert picture description here
There are two ways to incrementally import data:

  • Incremental data import based on incremental columns (Append method)
  • Incremental import of data based on time column (LastModified method)
    Append method
    1. Prepare initial data
-- 删除 MySQL 表中的全部数据
truncate table sqoop.goodtbl;
 
-- 删除 Hive 表中的全部数据
truncate table mydb.goodtbl;
 
-- 向MySQL的表中插入100条数据
call batchInsertTestData(1, 100);

2. Import data into Hive

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive --password 12345678 \
--table goodtbl \
--incremental append \
--hive-import \
--fields-terminated-by "\t" \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 0 \
-m 1

Insert picture description here
3. Check whether there is data in the hive table and how many pieces of data are there.
4. Add 1000 pieces of data to MySQL, and the numbering starts from 200

call batchInsertTestData(200, 1000);

5. Perform incremental import again to import data from MySQL into Hive; at this time, change the last-value to 100

sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive --password 12345678 \
--table goodtbl \
--incremental append \
--hive-import \
--fields-terminated-by "\t" \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 100 \
-m 1

6. Check whether there is data in the hive table and how many pieces of data are there

Section 4 Execute job

There are two ways to implement incremental data import:

  1. Manually configure the last-value each time, and manually schedule
  2. Using job, given initial last-value, timed tasks are scheduled regularly every day

Obviously method 2 is easier.

1. Create a password file

echo -n "12345678" > sqoopPWD.pwd
hdfs dfs -mkdir -p /sqoop/pwd
hdfs dfs -put sqoopPWD.pwd /sqoop/pwd
hdfs dfs -chmod 400 /sqoop/pwd/sqoopPWD.pwd
 
# 可以在 sqoop 的 job 中增加:
--password-file /sqoop/pwd/sqoopPWD.pwd

2. Create sqoop job

# 创建 sqoop job
sqoop job --create myjob1 -- import \
--connect jdbc:mysql://linux123:3306/sqoop?useSSL=false \
--username hive \
--password-file /sqoop/pwd/sqoopPWD.pwd \
--table goodtbl \
--incremental append \
--hive-import \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 0 \
-m 1
 
# 查看已创建的job
sqoop job --list
 
# 查看job详细运行时参数
sqoop job --show myjob1
 
# 执行job
sqoop job --exec myjob1
 
# 删除job
sqoop job --delete myjob1

3. Execute job

sqoop job -exec myjob1

4. View the data

Implementation principle:
After the job is executed, the maximum value of the current check-column will be recorded in the meta, and this value will be assigned to the last-value when it is called up next time.
By default metadata is saved in ~/.sqoop/

Among them, the metastore.db.script file records the update operation to the last-value:

cat metastore.db.script |grep incremental.last.value

The fourth part of commonly used commands and parameters

Section 1 Common Commands

Insert picture description here
Insert picture description here

Section 2 Common Parameters

The so-called public parameters are those supported by most commands.
Common parameters-database connection
Insert picture description here
common parameters-import
Insert picture description here
common parameters-export
Insert picture description here
common parameters-hive
Insert picture description here
import parameters
Insert picture description here
Insert picture description here
export parameters
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_47134119/article/details/113659432