Sqoop use summary

Sqoop use summary

1. Sqoop basic commands

1. View all commands

# sqoop help

Insert picture description here

2. View the specific usage of a command

# sqoop help 命令名

Two, Sqoop and MySQL

1. Query all MySQL databases

Usually used to test the connection between Sqoop and MySQL:

sqoop list-databases \
--connect jdbc:mysql://hadoop001:3306/ \
--username root \
--password root

Insert picture description here

2. Query all data tables in the specified database

sqoop list-tables \
--connect jdbc:mysql://hadoop001:3306/mysql \
--username root \
--password root

3. Sqoop and HDFS

3.1 Import MySQL data to HDFS

1. Import commands

Example: Export MySQL database help_keywordtables to the HDFS /sqoopdirectory, the directory exists if introduced remove and then introduced, using three map tasksparallel import.

Note: help_keyword is a dictionary table built into MySQL, and the following examples all use this table.

sqoop import \
--connect jdbc:mysql://hadoop001:3306/mysql \     
--username root \
--password root \
--table help_keyword \           # 待导入的表
--delete-target-dir \            # 目标目录存在则先删除
--target-dir /sqoop \            # 导入的目标目录
--fields-terminated-by '\t'  \   # 指定导出数据的分隔符
-m 3                             # 指定并行执行的 map tasks 数量

As log output, the input data can be seen is the average splitof three, each consisting of three map taskprocesses. The data is split based on the primary key column of the table by default. If your table does not have a primary key, there are two options:

  • Add -- autoreset-to-one-mapperparameter representative of only a start map task, that is not executed in parallel;
  • When the parallel execution is still desired, it is possible to use --split-by <column-name>a reference column indicates the split data.

Insert picture description here

2. Import verification

# 查看导入后的目录
hadoop fs -ls  -R /sqoop
# 查看导入内容
hadoop fs -text  /sqoop/part-m-00000

Check the HDFS import directory, you can see that the data in the table is divided into 3 parts for storage, which is determined by the specified parallelism.

Insert picture description here

3.2 Export HDFS data to MySQL

sqoop export  \
    --connect jdbc:mysql://hadoop001:3306/mysql \
    --username root \
    --password root \
    --table help_keyword_from_hdfs \        # 导出数据存储在 MySQL 的 help_keyword_from_hdf 的表中
    --export-dir /sqoop  \
    --input-fields-terminated-by '\t'\
    --m 3 

The table must be created in advance, and the statement to build the table is as follows:

CREATE TABLE help_keyword_from_hdfs LIKE help_keyword ;

Four, Sqoop and Hive

4.1 Import MySQL data to Hive

Sqoop to import data into Hive is introduced through the first data to a temporary directory on HDFS, then data from HDFS Loadto Hive, and finally delete the temporary directory. It can be used target-dirto specify a temporary directory.

1. Import commands

sqoop import \
  --connect jdbc:mysql://hadoop001:3306/mysql \
  --username root \
  --password root \
  --table help_keyword \        # 待导入的表     
  --delete-target-dir \         # 如果临时目录存在删除
  --target-dir /sqoop_hive  \   # 临时目录位置
  --hive-database sqoop_test \  # 导入到 Hive 的 sqoop_test 数据库,数据库需要预先创建。不指定则默认为 default 库
  --hive-import \               # 导入到 Hive
  --hive-overwrite \            # 如果 Hive 表中有数据则覆盖,这会清除表中原有的数据,然后再写入
  -m 3                          # 并行度

Hive imported into the sqoop_testdatabase needs to be created in advance, do not specify the use of Hive in the default defaultlibrary.

 # 查看 hive 中的所有数据库
 hive>  SHOW DATABASES;
 # 创建 sqoop_test 数据库
 hive>  CREATE DATABASE sqoop_test;

2. Import verification

# 查看 sqoop_test 数据库的所有表
 hive>  SHOW  TABLES  IN  sqoop_test;
# 查看表中数据
 hive> SELECT * FROM sqoop_test.help_keyword;

Insert picture description here

3. Possible problems

Insert picture description here

If the execution error java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf, you need to install Hive directory libin hive-exec-**.jarplace of sqoop lib.

[root@hadoop001 lib]# ll hive-exec-*
-rw-r--r--. 1 1106 4001 19632031 11 月 13 21:45 hive-exec-1.1.0-cdh5.15.2.jar
[root@hadoop001 lib]# cp hive-exec-1.1.0-cdh5.15.2.jar  ${SQOOP_HOME}/lib

4.2 Hive export data to MySQL

Since Hive data is stored on HDFS, Hive imports data to MySQL, which is actually HDFS imports data to MySQL.

1. View the storage location of the Hive table in HDFS

# 进入对应的数据库
hive> use sqoop_test;
# 查看表信息
hive> desc formatted help_keyword;

Location The attribute is its storage location:

Insert picture description here

Here you can check this directory, the file structure is as follows:

Insert picture description here

3.2 Execute export command

sqoop export  \
    --connect jdbc:mysql://hadoop001:3306/mysql \
    --username root \
    --password root \
    --table help_keyword_from_hive \
    --export-dir /user/hive/warehouse/sqoop_test.db/help_keyword  \
    -input-fields-terminated-by '\001' \             # 需要注意的是 hive 中默认的分隔符为 \001
    --m 3 

The tables in MySQL need to be created in advance:

CREATE TABLE help_keyword_from_hive LIKE help_keyword ;

Five, Sqoop and HBase

This section only explains importing data from RDBMS to HBase, because there is no command to export data directly from HBase to RDBMS.

5.1 MySQL import data to HBase

1. Import data

The help_keywordimport data into the table HBase help_keyword_hbasetable, using the master key of the original table help_keyword_idas RowKeyall of the original table are listed in keywordInfothe column group, only supports the entire column into the next group, the group does not support columns are designated.

sqoop import \
    --connect jdbc:mysql://hadoop001:3306/mysql \
    --username root \
    --password root \
    --table help_keyword \              # 待导入的表
    --hbase-table help_keyword_hbase \  # hbase 表名称,表需要预先创建
    --column-family keywordInfo \       # 所有列导入到 keywordInfo 列族下 
    --hbase-row-key help_keyword_id     # 使用原表的 help_keyword_id 作为 RowKey

The imported HBase table needs to be created in advance:

# 查看所有表
hbase> list
# 创建表
hbase> create 'help_keyword_hbase', 'keywordInfo'
# 查看表信息
hbase> desc 'help_keyword_hbase'

2. Import verification

Use scanview table data:

Insert picture description here

Six, the whole library export

Sqoop supported by import-all-tablesfull library exports to HDFS / Hive command, but note the following two limitations:

  • All tables must have a primary key; or use it --autoreset-to-one-mapper, which means that only one is started map task;
  • You cannot use non-default split columns, nor can you add any restrictions through the WHERE clause.

The second point of explanation is more confusing, here is the official original explanation:

  • You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.

Export the entire library to HDFS:

sqoop import-all-tables \
    --connect jdbc:mysql://hadoop001:3306/数据库名 \
    --username root \
    --password root \
    --warehouse-dir  /sqoop_all \     # 每个表会单独导出到一个目录,需要用此参数指明所有目录的父目录
    --fields-terminated-by '\t'  \
    -m 3

Export the entire library to Hive:

sqoop import-all-tables -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
  --connect jdbc:mysql://hadoop001:3306/数据库名 \
  --username root \
  --password root \
  --hive-database sqoop_test \         # 导出到 Hive 对应的库   
  --hive-import \
  --hive-overwrite \
  -m 3

Seven, Sqoop data filtering

7.1 query parameters

Sqoop supports the use of queryparameters defined query SQL, which can export any result set you want. Examples of usage are as follows:

sqoop import \
  --connect jdbc:mysql://hadoop001:3306/mysql \
  --username root \
  --password root \
  --query 'select * from help_keyword where  $CONDITIONS and  help_keyword_id < 50' \  
  --delete-target-dir \            
  --target-dir /sqoop_hive  \ 
  --hive-database sqoop_test \           # 指定导入目标数据库 不指定则默认使用 Hive 中的 default 库
  --hive-table filter_help_keyword \     # 指定导入目标表
  --split-by help_keyword_id \           # 指定用于 split 的列      
  --hive-import \                        # 导入到 Hive
  --hive-overwrite \                     、
  -m 3                                  

In use querywhen data filtering, note the following three points:

  • You must --hive-tablespecify the target table;
  • If the degree of parallelism -mis not a designated or not --autoreset-to-one-mapper, you need to use --split-byspecified in the reference column;
  • SQL's wherewords must be included $CONDITIONS, which is fixed wording, is to replace the dynamic.

7.2 Incremental import

sqoop import \
    --connect jdbc:mysql://hadoop001:3306/mysql \
    --username root \
    --password root \
    --table help_keyword \
    --target-dir /sqoop_hive  \
    --hive-database sqoop_test \         
    --incremental  append  \             # 指明模式
    --check-column  help_keyword_id \    # 指明用于增量导入的参考列
    --last-value 300  \                  # 指定参考列上次导入的最大值
    --hive-import \   
    -m 3  

incremental The parameter has the following two optional options:

  • the append : required value of the reference column must be incremented, is greater than all last-valuethe values are imported;
  • LastModified : required value must be a reference to the column timestamptype, and the time to insert data into the current timestamp in the reference column, the column should update timestamp when updating the reference data, later than all the last-valuedata are imported.

From the above explanation, we can see that, in fact, Sqoop's incremental import does not have too many artifacts. It depends on the maintained reference column to determine which is incremental data. Of course, we can also use the described above queryparameters to manually incremental export, so instead of more flexible.

8. Type support

Sqoop supports most of the field types of the database by default, but some special types are not supported. Encountered an unsupported type, the program will thrown Hive does not support the SQL type for column xxxan exception and can now be cast by the following two parameters:

  • --Map-column-java<mapping> : rewrite the mapping of SQL to Java types;
  • --Map-column-hive <mapping> : Rewrite the mapping from Hive to Java type.

Examples are as follows, and the original idfield forced into String type valuefield forced into Integer type:

$ sqoop import ... --map-column-java id=String,value=Integer

Reference

Sqoop User Guide (v1.4.7)

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/108884300