Sqoop use summary
1. Sqoop basic commands
1. View all commands
# sqoop help
2. View the specific usage of a command
# sqoop help 命令名
Two, Sqoop and MySQL
1. Query all MySQL databases
Usually used to test the connection between Sqoop and MySQL:
sqoop list-databases \
--connect jdbc:mysql://hadoop001:3306/ \
--username root \
--password root
2. Query all data tables in the specified database
sqoop list-tables \
--connect jdbc:mysql://hadoop001:3306/mysql \
--username root \
--password root
3. Sqoop and HDFS
3.1 Import MySQL data to HDFS
1. Import commands
Example: Export MySQL database help_keyword
tables to the HDFS /sqoop
directory, the directory exists if introduced remove and then introduced, using three map tasks
parallel import.
Note: help_keyword is a dictionary table built into MySQL, and the following examples all use this table.
sqoop import \
--connect jdbc:mysql://hadoop001:3306/mysql \
--username root \
--password root \
--table help_keyword \ # 待导入的表
--delete-target-dir \ # 目标目录存在则先删除
--target-dir /sqoop \ # 导入的目标目录
--fields-terminated-by '\t' \ # 指定导出数据的分隔符
-m 3 # 指定并行执行的 map tasks 数量
As log output, the input data can be seen is the average split
of three, each consisting of three map task
processes. The data is split based on the primary key column of the table by default. If your table does not have a primary key, there are two options:
- Add
-- autoreset-to-one-mapper
parameter representative of only a startmap task
, that is not executed in parallel; - When the parallel execution is still desired, it is possible to use
--split-by <column-name>
a reference column indicates the split data.
2. Import verification
# 查看导入后的目录
hadoop fs -ls -R /sqoop
# 查看导入内容
hadoop fs -text /sqoop/part-m-00000
Check the HDFS import directory, you can see that the data in the table is divided into 3 parts for storage, which is determined by the specified parallelism.
3.2 Export HDFS data to MySQL
sqoop export \
--connect jdbc:mysql://hadoop001:3306/mysql \
--username root \
--password root \
--table help_keyword_from_hdfs \ # 导出数据存储在 MySQL 的 help_keyword_from_hdf 的表中
--export-dir /sqoop \
--input-fields-terminated-by '\t'\
--m 3
The table must be created in advance, and the statement to build the table is as follows:
CREATE TABLE help_keyword_from_hdfs LIKE help_keyword ;
Four, Sqoop and Hive
4.1 Import MySQL data to Hive
Sqoop to import data into Hive is introduced through the first data to a temporary directory on HDFS, then data from HDFS Load
to Hive, and finally delete the temporary directory. It can be used target-dir
to specify a temporary directory.
1. Import commands
sqoop import \
--connect jdbc:mysql://hadoop001:3306/mysql \
--username root \
--password root \
--table help_keyword \ # 待导入的表
--delete-target-dir \ # 如果临时目录存在删除
--target-dir /sqoop_hive \ # 临时目录位置
--hive-database sqoop_test \ # 导入到 Hive 的 sqoop_test 数据库,数据库需要预先创建。不指定则默认为 default 库
--hive-import \ # 导入到 Hive
--hive-overwrite \ # 如果 Hive 表中有数据则覆盖,这会清除表中原有的数据,然后再写入
-m 3 # 并行度
Hive imported into the sqoop_test
database needs to be created in advance, do not specify the use of Hive in the default default
library.
# 查看 hive 中的所有数据库
hive> SHOW DATABASES;
# 创建 sqoop_test 数据库
hive> CREATE DATABASE sqoop_test;
2. Import verification
# 查看 sqoop_test 数据库的所有表
hive> SHOW TABLES IN sqoop_test;
# 查看表中数据
hive> SELECT * FROM sqoop_test.help_keyword;
3. Possible problems
If the execution error java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
, you need to install Hive directory lib
in hive-exec-**.jar
place of sqoop lib
.
[root@hadoop001 lib]# ll hive-exec-*
-rw-r--r--. 1 1106 4001 19632031 11 月 13 21:45 hive-exec-1.1.0-cdh5.15.2.jar
[root@hadoop001 lib]# cp hive-exec-1.1.0-cdh5.15.2.jar ${SQOOP_HOME}/lib
4.2 Hive export data to MySQL
Since Hive data is stored on HDFS, Hive imports data to MySQL, which is actually HDFS imports data to MySQL.
1. View the storage location of the Hive table in HDFS
# 进入对应的数据库
hive> use sqoop_test;
# 查看表信息
hive> desc formatted help_keyword;
Location
The attribute is its storage location:
Here you can check this directory, the file structure is as follows:
3.2 Execute export command
sqoop export \
--connect jdbc:mysql://hadoop001:3306/mysql \
--username root \
--password root \
--table help_keyword_from_hive \
--export-dir /user/hive/warehouse/sqoop_test.db/help_keyword \
-input-fields-terminated-by '\001' \ # 需要注意的是 hive 中默认的分隔符为 \001
--m 3
The tables in MySQL need to be created in advance:
CREATE TABLE help_keyword_from_hive LIKE help_keyword ;
Five, Sqoop and HBase
This section only explains importing data from RDBMS to HBase, because there is no command to export data directly from HBase to RDBMS.
5.1 MySQL import data to HBase
1. Import data
The help_keyword
import data into the table HBase help_keyword_hbase
table, using the master key of the original table help_keyword_id
as RowKey
all of the original table are listed in keywordInfo
the column group, only supports the entire column into the next group, the group does not support columns are designated.
sqoop import \
--connect jdbc:mysql://hadoop001:3306/mysql \
--username root \
--password root \
--table help_keyword \ # 待导入的表
--hbase-table help_keyword_hbase \ # hbase 表名称,表需要预先创建
--column-family keywordInfo \ # 所有列导入到 keywordInfo 列族下
--hbase-row-key help_keyword_id # 使用原表的 help_keyword_id 作为 RowKey
The imported HBase table needs to be created in advance:
# 查看所有表
hbase> list
# 创建表
hbase> create 'help_keyword_hbase', 'keywordInfo'
# 查看表信息
hbase> desc 'help_keyword_hbase'
2. Import verification
Use scan
view table data:
Six, the whole library export
Sqoop supported by import-all-tables
full library exports to HDFS / Hive command, but note the following two limitations:
- All tables must have a primary key; or use it
--autoreset-to-one-mapper
, which means that only one is startedmap task
; - You cannot use non-default split columns, nor can you add any restrictions through the WHERE clause.
The second point of explanation is more confusing, here is the official original explanation:
- You must not intend to use non-default splitting column, nor impose any conditions via a
WHERE
clause.
Export the entire library to HDFS:
sqoop import-all-tables \
--connect jdbc:mysql://hadoop001:3306/数据库名 \
--username root \
--password root \
--warehouse-dir /sqoop_all \ # 每个表会单独导出到一个目录,需要用此参数指明所有目录的父目录
--fields-terminated-by '\t' \
-m 3
Export the entire library to Hive:
sqoop import-all-tables -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://hadoop001:3306/数据库名 \
--username root \
--password root \
--hive-database sqoop_test \ # 导出到 Hive 对应的库
--hive-import \
--hive-overwrite \
-m 3
Seven, Sqoop data filtering
7.1 query parameters
Sqoop supports the use of query
parameters defined query SQL, which can export any result set you want. Examples of usage are as follows:
sqoop import \
--connect jdbc:mysql://hadoop001:3306/mysql \
--username root \
--password root \
--query 'select * from help_keyword where $CONDITIONS and help_keyword_id < 50' \
--delete-target-dir \
--target-dir /sqoop_hive \
--hive-database sqoop_test \ # 指定导入目标数据库 不指定则默认使用 Hive 中的 default 库
--hive-table filter_help_keyword \ # 指定导入目标表
--split-by help_keyword_id \ # 指定用于 split 的列
--hive-import \ # 导入到 Hive
--hive-overwrite \ 、
-m 3
In use query
when data filtering, note the following three points:
- You must
--hive-table
specify the target table; - If the degree of parallelism
-m
is not a designated or not--autoreset-to-one-mapper
, you need to use--split-by
specified in the reference column; - SQL's
where
words must be included$CONDITIONS
, which is fixed wording, is to replace the dynamic.
7.2 Incremental import
sqoop import \
--connect jdbc:mysql://hadoop001:3306/mysql \
--username root \
--password root \
--table help_keyword \
--target-dir /sqoop_hive \
--hive-database sqoop_test \
--incremental append \ # 指明模式
--check-column help_keyword_id \ # 指明用于增量导入的参考列
--last-value 300 \ # 指定参考列上次导入的最大值
--hive-import \
-m 3
incremental
The parameter has the following two optional options:
- the append : required value of the reference column must be incremented, is greater than all
last-value
the values are imported; - LastModified : required value must be a reference to the column
timestamp
type, and the time to insert data into the current timestamp in the reference column, the column should update timestamp when updating the reference data, later than all thelast-value
data are imported.
From the above explanation, we can see that, in fact, Sqoop's incremental import does not have too many artifacts. It depends on the maintained reference column to determine which is incremental data. Of course, we can also use the described above query
parameters to manually incremental export, so instead of more flexible.
8. Type support
Sqoop supports most of the field types of the database by default, but some special types are not supported. Encountered an unsupported type, the program will thrown Hive does not support the SQL type for column xxx
an exception and can now be cast by the following two parameters:
- --Map-column-java<mapping> : rewrite the mapping of SQL to Java types;
- --Map-column-hive <mapping> : Rewrite the mapping from Hive to Java type.
Examples are as follows, and the original id
field forced into String type value
field forced into Integer type:
$ sqoop import ... --map-column-java id=String,value=Integer