Sqoop import practice

Sqoop import practice

Sqoop-import

Case 1

The table does not have a primary key, and the number of map tasks needs to be specified as 1 to execute

Sqoop import principle:

By default, Sqoop imports data from database sources in parallel. You can use the -m or --num-mappers parameter to specify the number of map tasks (parallel processes) used to perform the import. Each argument takes an integer value corresponding to the degree of parallelism to use. By default, four tasks are used. Some databases can improve performance by increasing this value to 8 or 16.

By default, Sqoop uses the primary key id column in the identity table as the split column. The high and low values ​​of the split column are retrieved from the database, and the map task operates on evenly sized components across the range. For example, if the ID range is 0-800, then Sqoop runs 4 processes by default, SELECT MIN(id), MAX(id) FROM empfind out , and then set the id range of the 4 tasks to (0-200), (200-400), (400- 600),(600-800)

But when a table does not have a primary key, the above segmentation cannot be performed, and an error will occur when Sqoop is imported. At this time, the number of mappers can be set to 1 through -m, and only one Mapper is running. At this time, there is no need for segmentation , It can also avoid the problem of reporting errors when the primary key does not exist.

#错误信息
ERROR tool.ImportTool: Import failed: No primary key could be found for table emp. Please specify one with --split-by or perform a sequential import with '-m 1'.

copy code

Import code:

[root@qianfeng01 sqoop-1.4.7]# bin/sqoop import --connect jdbc:mysql://localhost:3306/qfdb \
--username root --password 123456 \
--table emp -m 1
copy code

DBMS-HDFS

Case 2

The table has no primary key, use --split-by to specify the field to perform split

The problem is the same as above, if the table does not have a primary key, then there is another way to manually specify the columns to be split, by --split-byspecifying

[root@qianfeng01 sqoop-1.4.7]# bin/sqoop import --connect jdbc:mysql://localhost:3306/qfdb \
--username root --password 123456 \
--table emp \
--split-by empno \
--delete-target-dir \
--target-dir hdfs://qianfeng01:8020/sqoopdata/emp
copy code
-- 出错
Caused by: java.sql.SQLException: null,  message from server: "Host 'qianfeng01' is not allowed to connect to this MySQL server"
copy code

solution:

First connect to MySql:

[root@qianfeng01 sqoop-1.4.7]# mysql -uroot -p
copy code

(Execute the following statement.: All tables under all libraries%: Any IP address or host can be connected)

mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'mysql' WITH GRANT OPTION;
    FLUSH PRIVILEGES;
copy code

Case 3 : Conditional import (incremental import)

The data that needs to be imported is not all, but imported with conditions

[root@qianfeng01 sqoop-1.4.7]# bin/sqoop import --connect jdbc:mysql://localhost:3306/qfdb \
--username root --password 123456 \
--table emp \
--split-by empno \
--where 'empno > 7777' \
--target-dir hdfs://qianfeng01:8020/sqoopdata/emp
copy code

Case 4: Partial field import

The data to be imported, do not want to include all fields, only some fields are required

Note: This is similar to where, and it is more flexible when used

[root@qianfeng01 sqoop-1.4.7] bin/sqoop import --connect jdbc:mysql://localhost:3306/qfdb \
--username root --password 123456 \
--split-by empno \
--query 'select empno,ename,job from emp where empno > 7777 and $CONDITIONS' \
--target-dir hdfs://qianfeng01:8020/sqoopdata/7
copy code

DBMS-Hive

Case 5: Import data into Hive

[root@qianfeng01 sqoop-1.4.7]# bin/sqoop import --connect jdbc:mysql://localhost:3306/qfdb 
--username root 
--password 123456
--table emp 
--hive-import 
-m 1
copy code

DBMS-HBase

Import data into HBase

hbase中创建表:
create 'mysql2hbase','info'

# 方法一:
[root@qianfeng01 sqoop-1.4.7]# sqoop import  --connect jdbc:mysql://qianfeng01:3306/qfdb \
--username root \
--password 123456 \
--table emp \
--hbase-table mysql2hbase \
--column-family info \
--hbase-create-table \
--hbase-row-key empno \
-m 1 \


注意:如果使用的是Hbase2.X版本以上,那么需要添加依赖(1.6版本的依赖),不然会出现如下错误

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.HBaseAdmin.<init>(Lorg/apache/hadoop/conf/Configuration;)V

下载安装包:https://archive.apache.org/dist/hbase/1.6.0/
操作方式:将1.6版本的Hbase的依赖lib全部拉去到Sqoop对应文件夹lib下面,再次执行上面的命令

测试:
hbase(main):008:0> scan 'mysql2hbase'
ROW                                      COLUMN+CELL
 1                                       column=info:hobby, timestamp=1585852383291, value=1
 1                                       column=info:profile, timestamp=1585852383291, value=\xE6\xBC\x94\xE5\x91\x98
 1                                       column=info:uname, timestamp=1585852383291, value=bingbing
 2                                       column=info:hobby, timestamp=1585852383291, value=2
 2                                       column=info:profile, timestamp=1585852383291, value=\xE6\xBC\x94\xE5\x91\x98
 2                                       column=info:uname, timestamp=1585852383291, value=feifei
 3                                       column=info:hobby, timestamp=1585852383291, value=1
 3                                       column=info:profile, timestamp=1585852383291, value=\xE5\x94\xB1\xE6\xAD\x8C
 3                                       column=info:uname, timestamp=1585852383291, value=\xE5\x8D\x8E\xE4\xBB\x94
3 row(s) in 2.2770 seconds


# 方法二:
hbase(main):004:0> create 'mysql2hbase11','info'
[root@qianfeng01 sqoop-1.4.7]# sqoop import  --connect jdbc:mysql://qianfeng01:3306/qfdb \
--username root \
--password 123456 \
--table emp \
--hbase-table mysql2hbase11 \
--delete-target-dir \
--column-family info \
--hbase-create-table \
--hbase-row-key empno \
-m 1 \
--hbase-bulkload 

运行后在结尾处有结果(Trying to load hfile):
s20/04/03 10:41:11 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://qianfeng01:8020/user/root/user_info/_SUCCESS
h20/04/03 10:41:12 INFO hfile.CacheConfig: CacheConfig:disabled
a20/04/03 10:41:12 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://qianfeng01:8020/user/root/emp/info/1aef7d02d1a646008f18d49cbb23f20f first=1 last=3


注:
-- hbase-bulkload 不用输入路径,会自己默认导出到某目录,然后完成后自行装载数据到hbase表中;
-m 需要再--hbase-bulkload之前出现

# 测试:
hbase(main):004:0> scan 'mysql2hbase1'
ROW                                      COLUMN+CELL
 1                                       column=info:hobby, timestamp=1585881667767, value=1
 1                                       column=info:profile, timestamp=1585881667767, value=\xE6\xBC\x94\xE5\x91\x98
 1                                       column=info:uname, timestamp=1585881667767, value=bingbing
 2                                       column=info:hobby, timestamp=1585881667767, value=2
 2                                       column=info:profile, timestamp=1585881667767, value=\xE6\xBC\x94\xE5\x91\x98
 2                                       column=info:uname, timestamp=1585881667767, value=feifei
 3                                       column=info:hobby, timestamp=1585881667767, value=1
 3                                       column=info:profile, timestamp=1585881667767, value=\xE5\x94\xB1\xE6\xAD\x8C
 3                                       column=info:uname, timestamp=1585881667767, value=\xE5\x8D\x8E\xE4\xBB\x94
3 row(s) in 0.6170 seconds

copy code

Import data incrementally

scenes to be used

  1. For tables that are frequently operated and continuously generate data, it is recommended to increment.
  2. When the cardinality of a table is very large, but the change is small, it is also recommended to increment

Use A

  1. query where: can precisely lock the data range

  2. incremental : Incremental, the last recorded value is done

query where method

Import by querying specific dates

Create a new script file

mysql中的表格:
 CREATE TABLE qfdb.sales_order(
    orderid INT PRIMARY KEY,
    order_date DATE
    )
[root@qianfeng01 sqoop-1.4.7] vi ./import.sh
copy code

Write the following:

#!/bin/bash
# yesterday=`date -d "1 days ago" "+%Y-%m-%d"`
yesterday=$1
sqoop import --connect jdbc:mysql://qianfeng01:3306/qfdb \
--username root \
--password 123456 \
--query "select * from sales_order where DATE(order_date) = '${yesterday}' and \$CONDITIONS" \
--delete-target-dir \
--target-dir /user/hive/warehouse/sales_order/dt=${yesterday} \
-m 1 \
--fields-terminated-by '\t' 
copy code

implement

[root@qianfeng01 sqoop-1.4.7]# bash import.sh 2019-02-01
copy code

The results can be quickly queried through the following HDFS:

 [root@qianfeng01 sqoop-1.4.7]# hdfs dfs -cat /user/hive/warehouse/sales_order/dt=2019-01-01/pa*
copy code

The append method of increment

#将会手动维护last-value 
[root@qianfeng01 sqoop-1.4.7]# sqoop import --connect jdbc:mysql://qianfeng01:3306/qfdb \
--username root \
--password 123456 \
--table sales_order \
--driver com.mysql.jdbc.Driver \
--target-dir /user/hive/warehouse/sales_order1/dt=2019-12-30 \
--split-by order_id \
-m 1 \
--check-column order_number \
--incremental append \
--last-value 800 \
--fields-terminated-by '\t'
copy code
注意:--last-value 80000 \  从80000开始检查,如果后面有新的数据就会进行增量导入,如果没有新的数据会提示下面的信息
21/12/12 01:52:16 INFO tool.ImportTool: Incremental import based on column order_date
21/12/12 01:52:16 INFO tool.ImportTool: No new rows detected since last import.
copy code

Use the following command to view:

[root@qianfeng01 sqoop-1.4.7]# hdfs dfs -cat /user/hive/warehouse/sales_order1/dt=2019-12-30/pa*
copy code

Import fill-in-null data

[root@qianfeng01 ~]# sqoop import --connect jdbc:mysql://localhost:3306/qfdb --username root --password 123456 --table emp --delete-target-dir --target-dir hdfs://qianfeng01:9820/sqoopdata/emp --null-string '\\N' --null-non-string '0'
copy code

key parameter

--null-string '\\N'  ## 遇到空字符串会填充\N字符
--null-non-string '0' # 遇到空数字会填充0

 For more exciting content of big data, welcome to search " Qianfeng Education " at station B or scan the code to get a full set of materials

Guess you like

Origin blog.csdn.net/longz_org_cn/article/details/131302817