table of Contents
The first part of Sqoop overview
The second part of installation and configuration
Section 3 Incremental Data Import
Part IV Common Commands and Parameters
Common parameters-database connection
Data Migration Tool - Sqoop
The first part of Sqoop overview
Sqoop is an open source tool, mainly used to transfer data between Hadoop (Hive) and traditional databases (mysql, postgresql, etc.). You can import data from relational databases (MySQL, Oracle, Postgres, etc.) into HDFS, and you can also import data from HDFS into relational databases.
The Sqoop project started in 2009. It first existed as a third-party module of Hadoop. Later, in order to allow users to quickly deploy and to allow developers to iteratively develop faster, Sqoop independently became an Apache project.
The second part of installation and configuration
Sqoop official website: http://sqoop.apache.org/
Sqoop download address: http://www.apache.org/dyn/closer.lua/sqoop/
1. Download, upload and unzip
Upload the downloaded installation package sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz to the virtual machine; unzip the package;
tar zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
mv sqoop-1.4.7.bin__hadoop-2.6.0/ ../servers/sqoop-1.4.7/
2. Increase environment variables and make them effective
vi /etc/profile
# Add the following content
export SQOOP_HOME=/opt/lagou/servers/sqoop-1.4.7
export PATH=$PATH:$SQOOP_HOME/binsource /etc/profile
3. Create and modify configuration files
# Configuration file location $SQOOP_HOME/conf; the configuration file to be modified is sqoop-env.sh
cp sqoop-env-template.sh sqoop-env.sh
vi sqoop-env.sh# Add the following content at the end of the file
export HADOOP_COMMON_HOME=/opt/lagou/servers/hadoop-2.9.2
export HADOOP_MAPRED_HOME=/opt/lagou/servers/hadoop-2.9.2
export HIVE_HOME=/opt/lagou/servers/hive-2.3 .7
4. Copy the JDBC driver
# Copy the jdbc driver to the lib directory of sqoop (note: soft link can also be established)
ln -s /opt/lagou/servers/hive-2.3.7/lib/mysql-connector-java-5.1.46.jar /opt /lagou/servers/sqoop-1.4.7/lib/
5. Copy jar. Copy
hive-common-2.3.7.jar under $HIVE_HOME/lib to the $SQOOP_HOME/lib directory. If you do not copy the data from MySQL to Hive, an error will occur: ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
# Both hard copy and soft link can be established, just choose one to execute. The following is a hard copy
cp $HIVE_HOME/lib/hive-common-2.3.7.jar $SQOOP_HOME/lib/# Establish a soft link
ln -s /opt/lagou/servers/hive-2.3.7/lib/hive-common-2.3.7.jar /opt/lagou/servers/sqoop-1.4.7/lib/hive-common- 2.3.7.jar
Copy $HADOOP_HOME/share/hadoop/tools/lib/json-20170516.jar to the $SQOOP_HOME/lib/ directory; otherwise, when creating a sqoop job, it will report: java.lang.NoClassDefFoundError: org/json/JSONObject
cp $HADOOP_HOME/share/hadoop/tools/lib/json-20170516.jar $SQOOP_HOME/lib/
6. Installation verification
[root@linux123 ~]# sqoop version
...
Omit warning
...
20/06/19 10:37:24 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017
# Test whether Sqoop can successfully connect to the database
[root@linux123 ~]# sqoop list-databases --connect jdbc:mysql://linux123:3306/?useSSL=false --username hive --password 12345678
Warning: ...
Omit warning
...
information_schema
hivemetadata
mysql
performance_schema
sys
Part Three Application Case
In Sqoop
- Importing refers to: transferring data from relational databases to big data clusters (HDFS, HIVE, HBASE); use the import keyword;
- Export refers to: transfer data from a big data cluster to a relational database; use the export keyword;
Test data script
-- 用于在 Mysql 中生成测试数据
CREATE DATABASE sqoop;
use sqoop;
CREATE TABLE sqoop.goodtbl(
gname varchar(50),
serialNumber int,
price int,
stock_number int,
create_time date);
DROP FUNCTION IF EXISTS `rand_string`;
DROP PROCEDURE IF EXISTS `batchInsertTestData`;
-- 替换语句默认的执行符号,将;替换成 //
DELIMITER //
CREATE FUNCTION `rand_string` (n INT) RETURNS VARCHAR(255)
CHARSET 'utf8'
BEGIN
DECLARE char_str VARCHAR(200) DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ';
DECLARE return_str VARCHAR(255) DEFAULT '';
DECLARE i INT DEFAULT 0;
WHILE i < n DO
SET return_str = CONCAT(return_str, SUBSTRING(char_str, FLOOR(1 + RAND()*36), 1));
SET i = i+1;
END WHILE;
RETURN return_str;
END
//
-- 第一个参数表示:序号从几开始;第二个参数表示:插入多少条记录
CREATE PROCEDURE `batchInsertTestData` (m INT, n INT)
BEGIN
DECLARE i INT DEFAULT 0;
WHILE i < n DO
INSERT INTO goodtbl (gname, serialNumber, price, stock_number, create_time)
VALUES (rand_string(6), i+m, ROUND(RAND()*100), FLOOR(RAND()*100), NOW());
SET i = i+1;
END WHILE;
END
//
DELIMITER ;
call batchInsertTestData(1, 100);
Section 1 Import data import
MySQL to HDFS
1. Import all data
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl \
--target-dir /root/lagou \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"
Remarks:
- target-dir: the path to import data into HDFS;
- delete-target-dir: If the target folder already exists on HDFS, an error will be reported when you run it again. You can use --delete-target-dir to delete the directory first. You can also use the append parameter to indicate additional data;
- num-mappers: How many Map Tasks are started; 4 Map Tasks are started by default; it can also be written as -m 1
- fields-terminated-by: separator of data in HDFS file;
2. Import query data
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--append \
-m 1 \
--fields-terminated-by "\t" \
--query 'select gname, serialNumber, price, stock_number,
create_time from goodtbl where price>88 and $CONDITIONS order by price;'
Remarks:
- The where clause of the query statement must contain'$CONDITIONS'
- If double quotes are used after the query, a transfer character must be added before $CONDITIONS to prevent the shell from identifying its own variables
3. Import the specified column
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--columns gname,serialNumber,price \
--table goodtbl
Remarks: If multiple columns are involved in the columns, separate them with commas and do not add spaces
4. Import query data (use keywords)
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou \
--delete-target-dir \
-m 1 \
--fields-terminated-by "\t" \
--table goodtbl \
--where "price>=68"
5. Start multiple Map
Tasks to import data . Add data to goodtbl: call batchInsertTestData(1000000);
# 给 goodtbl 表增加主键
alter table goodtbl add primary key(serialNumber);
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou/sqoop/5 \
--append \
--fields-terminated-by "\t" \
--table goodtbl \
--split-by serialNumber
sqoop import \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--target-dir /root/lagou/sqoop/5 \
--delete-target-dir \
--fields-terminated-by "\t" \
--table goodtbl \
--split-by gname
Note:
When using multiple Map Tasks for data import, sqoop must partition the data of each Task
- If the table in MySQL has a primary key, just specify the number of Map Tasks
- If the table in MySQL has a primary key, use split-by to specify the partition field
- If the partition field is a character type, add it when using the sqoop command: -Dorg.apache.sqoop.splitter.allow_text_splitter=true. which is
sqoop import -
Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://liunx:3306/sqoop \
... ...
- The'$CONDITIONS' in the where clause of the query statement is also used for data partitioning, even if there is only one Map Task
MySQL to Hive
Create a table in hive:
CREATE TABLE mydb.goodtbl(
gname string,
serialNumber int,
price int,
stock_number int,
create_time date);
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table mydb.goodtbl \
-m 1
Parameter Description:
- hive-import. Required parameters, specify import hive
- hive-database. Hive library name (default value default)
- hive-table。Hive表名
- fields-terminated-by. Hive field separator
- hive-overwrite. Overwrite existing data
- create-hive-table. The hive table is created, but there may be errors in the table. It is not recommended to use this parameter, it is recommended to build the table in advance
Section 2 export data export
Enter the big data platform to import: import
Leave the big data platform export: export
Note: MySQL tables need to be created in advance
-- 提前创建表
CREATE TABLE sqoop.goodtbl2(
gname varchar(50),
serialNumber int,
price int,
stock_number int,
create_time date);
hive mydb.goodtbl → MySQL sqoop.goodtbl2
sqoop export \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive \
--password 12345678 \
--table goodtbl2 \
-m 4 \
--export-dir /user/hive/warehouse/mydb.db/goodtbl \
--input-fields-terminated-by "\t"
Section 3 Incremental Data Import
Change Data Capture (CDC)
All the previous data imports were performed. If the amount of data is small, complete source data extraction is adopted; if the amount of source data is large, it is necessary to extract the changed data. This data extraction mode is called change data capture, or CDC (Change Data Capture) for short.
CDC is roughly divided into two types: invasive and non-invasive. Intrusive refers to the performance impact of CDC operations on the source system. As long as the CDC operations perform SQL operations on the source database in any way, it is considered intrusive.
The four commonly used CDC methods are (the first three are invasive):
CDC based on timestamp. The extraction process can determine which data is incremental based on certain attribute columns. The most common attribute columns are as follows:
- Timestamp: It is better to have two columns, one insert timestamp, which indicates when it was created, and one update timestamp, which indicates the time of the last update;
- Sequence: Most databases provide self-increment function, the column in the table is defined as self-increment, it is easy to identify the newly inserted data according to the column;
This method is the simplest and most commonly used, but it has the following disadvantages:
- Cannot record delete record operation
- Unrecognized multiple updates
- No real-time capability
Trigger-based CDC. When SQL statements such as INSERT, UPDATE, and DELETE are executed, triggers in the database are activated. The triggers can be used to capture the changed data and save the data in the intermediate temporary table. Then these changed data are retrieved from the temporary table. In most cases, it is not allowed to add triggers to the operational database, and this method will reduce system performance and will not be adopted;
CDC based on snapshots. The data changes can be obtained by comparing the source table and the snapshot table. The snapshot-based CDC can detect inserted, updated, and deleted data, which is an advantage over the time stamp-based CDC scheme. The disadvantage is that it requires a lot of storage space to save the snapshot.
Log-based CDC. The most complex and non-invasive CDC method is the log-based method. The database will record every insert, update, and delete operation in the log. By parsing the log file, you can obtain relevant information. The log format of each relational database is inconsistent, and there is no universal product. Alibaba's canal can complete MySQL log file analysis.
There are two ways to import data incrementally:
- Incremental data import based on incremental columns (Append method)
- Incremental data import based on time column (LastModified method)
Append way
1. Prepare initial data
-- 删除 MySQL 表中的全部数据
truncate table sqoop.goodtbl;
-- 删除 Hive 表中的全部数据
truncate table mydb.goodtbl;
-- 向MySQL的表中插入100条数据
call batchInsertTestData(1, 100);
2. Import data into Hive
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive --password 12345678 \
--table goodtbl \
--incremental append \
--hive-import \
--fields-terminated-by "\t" \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 0 \
-m 1
Parameter Description:
- check-column is used to specify some columns (that is, you can specify multiple columns). These columns are used to check whether the data is imported as incremental data during incremental import, which is similar to auto-increment fields and timestamps in relational databases . The type of these specified columns cannot be any character type, such as char, varchar, etc.
- last-value specifies the maximum value of the specified field in the check column during the last import
3. Check whether there is data in the hive table and how many pieces of data are there
4. Add 1000 more data to MySQL, the number starts from 200
call batchInsertTestData(200, 1000);
5. Perform incremental import again to import the data from MySQL into Hive; at this time, change the last-value to 100
sqoop import \
--connect jdbc:mysql://linux123:3306/sqoop \
--username hive --password 12345678 \
--table goodtbl \
--incremental append \
--hive-import \
--fields-terminated-by "\t" \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 100 \
-m 1
6. Check whether there is data in the hive table and how many pieces of data are there
Section 4 execute job
There are two ways to implement incremental data import:
- Manually configure the last-value each time, manual scheduling
- Using job, given initial last-value, timed tasks are scheduled regularly every day
Obviously method 2 is easier.
1. Create a password file
echo -n "12345678" > sqoopPWD.pwd
hdfs dfs -mkdir -p /sqoop/pwd
hdfs dfs -put sqoopPWD.pwd /sqoop/pwd
hdfs dfs -chmod 400 /sqoop/pwd/sqoopPWD.pwd
# 可以在 sqoop 的 job 中增加:
--password-file /sqoop/pwd/sqoopPWD.pwd
2. Create sqoop job
# 创建 sqoop job
sqoop job --create myjob1 -- import \
--connect jdbc:mysql://linux123:3306/sqoop?useSSL=false \
--username hive \
--password-file /sqoop/pwd/sqoopPWD.pwd \
--table goodtbl \
--incremental append \
--hive-import \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 0 \
-m 1
# 查看已创建的job
sqoop job --list
# 查看job详细运行时参数
sqoop job --show myjob1
# 执行job
sqoop job --exec myjob1
# 删除job
sqoop job --delete myjob1
3. Execute job
sqoop job -exec myjob1
4. View data
Implementation principle:
After the job is executed, the maximum value of the current check-column will be recorded in the meta, and this value will be assigned to the last-value when the job is called up next time.
By default metadata is saved in ~/.sqoop/
Among them, the metastore.db.script file records the update operation to last-value:
cat metastore.db.script |grep incremental.last.value
Part IV Common Commands and Parameters
Section 1 Common Commands
Section 2 Common Parameters
The so-called public parameters are the parameters supported by most commands.
Common parameters-database connection
Common parameters - import
Common parameters - export
Common parameters - hive
import parameter
export parameters