Data Migration Tool – Sqoop
Article Directory
The first part of Sqoop overview
Sqoop is an open source tool mainly used to transfer data between Hadoop (Hive) and traditional databases (mysql, postgresql, etc.). You can import data from relational databases (MySQL, Oracle, Postgres, etc.) into HDFS, and you can also import data from HDFS into relational databases.
The Sqoop project started in 2009. It first existed as a third-party module of Hadoop. Later, in order to allow users to quickly deploy and to allow developers to iteratively develop faster, Sqoop independently became an Apache project.
The second part of installation and configuration
Sqoop official website: http://sqoop.apache.org/
Sqoop download address: http://www.apache.org/dyn/closer.lua/sqoop/
1. Download, upload and decompress
the downloaded installation package sqoop-1.4 .6.bin__hadoop-2.0.4-alpha.tar.gz upload to the virtual machine;
decompress the software package;
tar zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
mv sqoop-1.4.7.bin__hadoop-2.6.0/ ../servers/sqoop-1.4.7/
2. Increase environment variables and make them effective
vi /etc/profile
# 增加以下内容
export SQOOP_HOME=/opt/lagou/servers/sqoop-1.4.7
export PATH=$PATH:$SQOOP_HOME/bin
source /etc/profile
3. Create and modify configuration files
# 配置文件位置 $SQOOP_HOME/conf;要修改的配置文件为 sqoop-env.sh
cp sqoop-env-template.sh sqoop-env.sh
vi sqoop-env.sh
# 在文件最后增加以下内容
export HADOOP_COMMON_HOME=/opt/lagou/servers/hadoop-2.9.2
export HADOOP_MAPRED_HOME=/opt/lagou/servers/hadoop-2.9.2
export HIVE_HOME=/opt/lagou/servers/hive-2.3.7
4. Copy the JDBC driver
# 拷贝jdbc驱动到sqoop的lib目录下(备注:建立软链接也可以)
ln -s /opt/lagou/servers/hive-2.3.7/lib/mysql-connector-java-5.1.46.jar /opt/lagou/servers/sqoop-1.4.7/lib/
5, copy jar
the hive in HIVEHOME / lib - common -. 2.3.7 jar, copied to the hive-common-2.3.7.jar under HIVE_HOME / lib, copy toHIVEHO M E / L I B at the H I V E−common−2 . . 3 . . 7 . J A R & lt , copy shell to the SQOOP_HOME / lib directory. If you do not copy the data from MySQL to Hive, an error will occur: ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
# 硬拷贝 和 建立软链接都可以,选择一个执行即可。下面是硬拷贝
cp $HIVE_HOME/lib/hive-common-2.3.7.jar $SQOOP_HOME/lib/
# 建立软链接
ln -s /opt/lagou/servers/hive-2.3.7/lib/hive-common-2.3.7.jar /opt/lagou/servers/sqoop-1.4.7/lib/hive-common-2.3.7.jar
Copy HADOOPHOME / share / hadoop / tools / lib / json − 20170516.jar to HADOOP_HOME/share/hadoop/tools/lib/json-20170516.jar toHADOOPHOME/share/hadoop/tools/lib/json−2 0 1 7 0 5 1 6 . J A r copy shellfish to lower SQOOP_HOME / lib / directory; otherwise, when creating sqoop job will be reported: java.lang.NoClassDefFoundError: org / json / JSONObject
cp $HADOOP_HOME/share/hadoop/tools/lib/json-20170516.jar $SQOOP_HOME/lib/
6. Installation verification
[root@linux123 ~]# sqoop version
...
省略警告
...
20/06/19 10:37:24 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017
# 测试Sqoop是否能够成功连接数据库
[root@linux123 ~]# sqoop list-databases --connect jdbc:mysql://linux123:3306/?useSSL=false --username hive --password 12345678
Warning: ...
省略警告
...
information_schema
hivemetadata
mysql
performance_schema
sys
The third part of the application case
In Sqoop
- Importing refers to: transferring data from relational databases to big data clusters (HDFS, HIVE, HBASE);
use the import keyword; - Export refers to: transfer data from a big data cluster to a relational database;
use the export keyword;
Test data script
-- 用于在 Mysql 中生成测试数据
CREATE DATABASE sqoop;
use sqoop;
CREATE TABLE sqoop.goodtbl(
gname varchar(50),
serialNumber int,
price int,
stock_number int,
create_time date);
DROP FUNCTION IF EXISTS `rand_string`;
DROP PROCEDURE IF EXISTS `batchInsertTestData`;
-- 替换语句默认的执行符号,将;替换成 //
DELIMITER //
CREATE FUNCTION `rand_string` (n INT) RETURNS VARCHAR(255)
CHARSET 'utf8'
BEGIN
DECLARE char_str VARCHAR(200) DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ';
DECLARE return_str VARCHAR(255) DEFAULT '';
DECLARE i INT DEFAULT 0;
WHILE i < n DO
SET return_str = CONCAT(return_str, SUBSTRING(char_str, FLOOR(1 + RAND()*36), 1));
SET i = i+1;
END WHILE;
RETURN return_str;
END
//
-- 第一个参数表示:序号从几开始;第二个参数表示:插入多少条记录
CREATE PROCEDURE `batchInsertTestData` (m INT, n INT)
BEGIN
DECLARE i INT DEFAULT 0;
WHILE i < n DO
INSERT INTO goodtbl (gname, serialNumber, price, stock_number, create_time)
VALUES (rand_string(6), i+m, ROUND(RAND()*100), FLOOR(RAND()*100), NOW());
SET i = i+1;
END WHILE;
END
//
DELIMITER ;
call batchInsertTestData(1, 100);
Section 1 Import data import
MySQL to HDFS
1. Import all data
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl \
--target-dir /root/lagou \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"
2. Import query data
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--append \
-m 1 \
--fields-terminated-by "\t" \
--query 'select gname, serialNumber, price, stock_number,
create_time from goodtbl where price>88 and $CONDITIONS order by price;'
3. Import the specified column
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--columns gname,serialNumber,price \
--table goodtbl
Remarks: If multiple columns are involved in the columns, separate them with commas and cannot add spaces.
4. Import query data (use keywords)
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--delete-target-dir \
-m 1 \
--fields-terminated-by "\t" \
--table goodtbl \
--where "price>=68"
5. Start multiple Map
Tasks to import data . Add data to goodtbl: call batchInsertTestData(1000000);
# 给 goodtbl 表增加主键
alter table goodtbl add primary key(serialNumber);
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou/sqoop/5 \
--append \
--fields-terminated-by "\t" \
--table goodtbl \
--split-by serialNumber
MySQL to Hive
Create a table in hive:
CREATE TABLE mydb.goodtbl(
gname string,
serialNumber int,
price int,
stock_number int,
create_time date);
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table mydb.goodtbl \
-m 1
Section 2 Export Data
- Enter the big data platform to import: import
- Leave the big data platform to export: export
Note: MySQL tables need to be created in advance
-- 提前创建表
CREATE TABLE sqoop.goodtbl2(
gname varchar(50),
serialNumber int,
price int,
stock_number int,
create_time date);
sqoop export \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl2 \
-m 4 \
--export-dir /user/hive/warehouse/mydb.db/goodtbl \
--input-fields-terminated-by "\t"
Section 3 Incremental Data Import
There are two ways to incrementally import data:
- Incremental data import based on incremental columns (Append method)
- Incremental import of data based on time column (LastModified method)
Append method
1. Prepare initial data
-- 删除 MySQL 表中的全部数据
truncate table sqoop.goodtbl;
-- 删除 Hive 表中的全部数据
truncate table mydb.goodtbl;
-- 向MySQL的表中插入100条数据
call batchInsertTestData(1, 100);
2. Import data into Hive
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive --password 12345678 \
--table goodtbl \
--incremental append \
--hive-import \
--fields-terminated-by "\t" \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 0 \
-m 1
3. Check whether there is data in the hive table and how many pieces of data are there.
4. Add 1000 pieces of data to MySQL, and the numbering starts from 200
call batchInsertTestData(200, 1000);
5. Perform incremental import again to import data from MySQL into Hive; at this time, change the last-value to 100
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive --password 12345678 \
--table goodtbl \
--incremental append \
--hive-import \
--fields-terminated-by "\t" \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 100 \
-m 1
6. Check whether there is data in the hive table and how many pieces of data are there
Section 4 Execute job
There are two ways to implement incremental data import:
- Manually configure the last-value each time, and manually schedule
- Using job, given initial last-value, timed tasks are scheduled regularly every day
Obviously method 2 is easier.
1. Create a password file
echo -n "12345678" > sqoopPWD.pwd
hdfs dfs -mkdir -p /sqoop/pwd
hdfs dfs -put sqoopPWD.pwd /sqoop/pwd
hdfs dfs -chmod 400 /sqoop/pwd/sqoopPWD.pwd
# 可以在 sqoop 的 job 中增加:
--password-file /sqoop/pwd/sqoopPWD.pwd
2. Create sqoop job
# 创建 sqoop job
sqoop job --create myjob1 -- import \
--connect jdbc:mysql://linux123:3306/sqoop?useSSL=false \
--username hive \
--password-file /sqoop/pwd/sqoopPWD.pwd \
--table goodtbl \
--incremental append \
--hive-import \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 0 \
-m 1
# 查看已创建的job
sqoop job --list
# 查看job详细运行时参数
sqoop job --show myjob1
# 执行job
sqoop job --exec myjob1
# 删除job
sqoop job --delete myjob1
3. Execute job
sqoop job -exec myjob1
4. View the data
Implementation principle:
After the job is executed, the maximum value of the current check-column will be recorded in the meta, and this value will be assigned to the last-value when it is called up next time.
By default metadata is saved in ~/.sqoop/
Among them, the metastore.db.script file records the update operation to the last-value:
cat metastore.db.script |grep incremental.last.value
The fourth part of commonly used commands and parameters
Section 1 Common Commands
Section 2 Common Parameters
The so-called public parameters are those supported by most commands.
Common parameters-database connection
common parameters-import
common parameters-export
common parameters-hive
import parameters
export parameters