Sqoop import practice
Sqoop-import
Case 1
The table does not have a primary key, and the number of map tasks needs to be specified as 1 to execute
Sqoop import principle:
By default, Sqoop imports data from database sources in parallel. You can use the -m or --num-mappers parameter to specify the number of map tasks (parallel processes) used to perform the import. Each argument takes an integer value corresponding to the degree of parallelism to use. By default, four tasks are used. Some databases can improve performance by increasing this value to 8 or 16.
By default, Sqoop uses the primary key id column in the identity table as the split column. The high and low values of the split column are retrieved from the database, and the map task operates on evenly sized components across the range. For example, if the ID range is 0-800, then Sqoop runs 4 processes by default,
SELECT MIN(id), MAX(id) FROM emp
find out , and then set the id range of the 4 tasks to (0-200), (200-400), (400- 600),(600-800)But when a table does not have a primary key, the above segmentation cannot be performed, and an error will occur when Sqoop is imported. At this time, the number of mappers can be set to 1 through -m, and only one Mapper is running. At this time, there is no need for segmentation , It can also avoid the problem of reporting errors when the primary key does not exist.
#错误信息 ERROR tool.ImportTool: Import failed: No primary key could be found for table emp. Please specify one with --split-by or perform a sequential import with '-m 1'.
copy code
Import code:
[root@qianfeng01 sqoop-1.4.7]# bin/sqoop import --connect jdbc:mysql://localhost:3306/qfdb \
--username root --password 123456 \
--table emp -m 1
copy code
DBMS-HDFS
Case 2
The table has no primary key, use --split-by to specify the field to perform split
The problem is the same as above, if the table does not have a primary key, then there is another way to manually specify the columns to be split, by
--split-by
specifying
[root@qianfeng01 sqoop-1.4.7]# bin/sqoop import --connect jdbc:mysql://localhost:3306/qfdb \
--username root --password 123456 \
--table emp \
--split-by empno \
--delete-target-dir \
--target-dir hdfs://qianfeng01:8020/sqoopdata/emp
copy code
-- 出错
Caused by: java.sql.SQLException: null, message from server: "Host 'qianfeng01' is not allowed to connect to this MySQL server"
copy code
solution:
First connect to MySql:
[root@qianfeng01 sqoop-1.4.7]# mysql -uroot -p
copy code
(Execute the following statement.: All tables under all libraries%: Any IP address or host can be connected)
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'mysql' WITH GRANT OPTION;
FLUSH PRIVILEGES;
copy code
Case 3 : Conditional import (incremental import)
The data that needs to be imported is not all, but imported with conditions
[root@qianfeng01 sqoop-1.4.7]# bin/sqoop import --connect jdbc:mysql://localhost:3306/qfdb \
--username root --password 123456 \
--table emp \
--split-by empno \
--where 'empno > 7777' \
--target-dir hdfs://qianfeng01:8020/sqoopdata/emp
copy code
Case 4: Partial field import
The data to be imported, do not want to include all fields, only some fields are required
Note: This is similar to where, and it is more flexible when used
[root@qianfeng01 sqoop-1.4.7] bin/sqoop import --connect jdbc:mysql://localhost:3306/qfdb \
--username root --password 123456 \
--split-by empno \
--query 'select empno,ename,job from emp where empno > 7777 and $CONDITIONS' \
--target-dir hdfs://qianfeng01:8020/sqoopdata/7
copy code
DBMS-Hive
Case 5: Import data into Hive
[root@qianfeng01 sqoop-1.4.7]# bin/sqoop import --connect jdbc:mysql://localhost:3306/qfdb
--username root
--password 123456
--table emp
--hive-import
-m 1
copy code
DBMS-HBase
Import data into HBase
hbase中创建表:
create 'mysql2hbase','info'
# 方法一:
[root@qianfeng01 sqoop-1.4.7]# sqoop import --connect jdbc:mysql://qianfeng01:3306/qfdb \
--username root \
--password 123456 \
--table emp \
--hbase-table mysql2hbase \
--column-family info \
--hbase-create-table \
--hbase-row-key empno \
-m 1 \
注意:如果使用的是Hbase2.X版本以上,那么需要添加依赖(1.6版本的依赖),不然会出现如下错误
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.HBaseAdmin.<init>(Lorg/apache/hadoop/conf/Configuration;)V
下载安装包:https://archive.apache.org/dist/hbase/1.6.0/
操作方式:将1.6版本的Hbase的依赖lib全部拉去到Sqoop对应文件夹lib下面,再次执行上面的命令
测试:
hbase(main):008:0> scan 'mysql2hbase'
ROW COLUMN+CELL
1 column=info:hobby, timestamp=1585852383291, value=1
1 column=info:profile, timestamp=1585852383291, value=\xE6\xBC\x94\xE5\x91\x98
1 column=info:uname, timestamp=1585852383291, value=bingbing
2 column=info:hobby, timestamp=1585852383291, value=2
2 column=info:profile, timestamp=1585852383291, value=\xE6\xBC\x94\xE5\x91\x98
2 column=info:uname, timestamp=1585852383291, value=feifei
3 column=info:hobby, timestamp=1585852383291, value=1
3 column=info:profile, timestamp=1585852383291, value=\xE5\x94\xB1\xE6\xAD\x8C
3 column=info:uname, timestamp=1585852383291, value=\xE5\x8D\x8E\xE4\xBB\x94
3 row(s) in 2.2770 seconds
# 方法二:
hbase(main):004:0> create 'mysql2hbase11','info'
[root@qianfeng01 sqoop-1.4.7]# sqoop import --connect jdbc:mysql://qianfeng01:3306/qfdb \
--username root \
--password 123456 \
--table emp \
--hbase-table mysql2hbase11 \
--delete-target-dir \
--column-family info \
--hbase-create-table \
--hbase-row-key empno \
-m 1 \
--hbase-bulkload
运行后在结尾处有结果(Trying to load hfile):
s20/04/03 10:41:11 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://qianfeng01:8020/user/root/user_info/_SUCCESS
h20/04/03 10:41:12 INFO hfile.CacheConfig: CacheConfig:disabled
a20/04/03 10:41:12 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://qianfeng01:8020/user/root/emp/info/1aef7d02d1a646008f18d49cbb23f20f first=1 last=3
注:
-- hbase-bulkload 不用输入路径,会自己默认导出到某目录,然后完成后自行装载数据到hbase表中;
-m 需要再--hbase-bulkload之前出现
# 测试:
hbase(main):004:0> scan 'mysql2hbase1'
ROW COLUMN+CELL
1 column=info:hobby, timestamp=1585881667767, value=1
1 column=info:profile, timestamp=1585881667767, value=\xE6\xBC\x94\xE5\x91\x98
1 column=info:uname, timestamp=1585881667767, value=bingbing
2 column=info:hobby, timestamp=1585881667767, value=2
2 column=info:profile, timestamp=1585881667767, value=\xE6\xBC\x94\xE5\x91\x98
2 column=info:uname, timestamp=1585881667767, value=feifei
3 column=info:hobby, timestamp=1585881667767, value=1
3 column=info:profile, timestamp=1585881667767, value=\xE5\x94\xB1\xE6\xAD\x8C
3 column=info:uname, timestamp=1585881667767, value=\xE5\x8D\x8E\xE4\xBB\x94
3 row(s) in 0.6170 seconds
copy code
Import data incrementally
scenes to be used
- For tables that are frequently operated and continuously generate data, it is recommended to increment.
- When the cardinality of a table is very large, but the change is small, it is also recommended to increment
Use A
-
query where: can precisely lock the data range
-
incremental : Incremental, the last recorded value is done
query where method
Import by querying specific dates
Create a new script file
mysql中的表格:
CREATE TABLE qfdb.sales_order(
orderid INT PRIMARY KEY,
order_date DATE
)
[root@qianfeng01 sqoop-1.4.7] vi ./import.sh
copy code
Write the following:
#!/bin/bash
# yesterday=`date -d "1 days ago" "+%Y-%m-%d"`
yesterday=$1
sqoop import --connect jdbc:mysql://qianfeng01:3306/qfdb \
--username root \
--password 123456 \
--query "select * from sales_order where DATE(order_date) = '${yesterday}' and \$CONDITIONS" \
--delete-target-dir \
--target-dir /user/hive/warehouse/sales_order/dt=${yesterday} \
-m 1 \
--fields-terminated-by '\t'
copy code
implement
[root@qianfeng01 sqoop-1.4.7]# bash import.sh 2019-02-01
copy code
The results can be quickly queried through the following HDFS:
[root@qianfeng01 sqoop-1.4.7]# hdfs dfs -cat /user/hive/warehouse/sales_order/dt=2019-01-01/pa*
copy code
The append method of increment
#将会手动维护last-value
[root@qianfeng01 sqoop-1.4.7]# sqoop import --connect jdbc:mysql://qianfeng01:3306/qfdb \
--username root \
--password 123456 \
--table sales_order \
--driver com.mysql.jdbc.Driver \
--target-dir /user/hive/warehouse/sales_order1/dt=2019-12-30 \
--split-by order_id \
-m 1 \
--check-column order_number \
--incremental append \
--last-value 800 \
--fields-terminated-by '\t'
copy code
注意:--last-value 80000 \ 从80000开始检查,如果后面有新的数据就会进行增量导入,如果没有新的数据会提示下面的信息
21/12/12 01:52:16 INFO tool.ImportTool: Incremental import based on column order_date
21/12/12 01:52:16 INFO tool.ImportTool: No new rows detected since last import.
copy code
Use the following command to view:
[root@qianfeng01 sqoop-1.4.7]# hdfs dfs -cat /user/hive/warehouse/sales_order1/dt=2019-12-30/pa*
copy code
Import fill-in-null data
[root@qianfeng01 ~]# sqoop import --connect jdbc:mysql://localhost:3306/qfdb --username root --password 123456 --table emp --delete-target-dir --target-dir hdfs://qianfeng01:9820/sqoopdata/emp --null-string '\\N' --null-non-string '0'
copy code
key parameter
--null-string '\\N' ## 遇到空字符串会填充\N字符
--null-non-string '0' # 遇到空数字会填充0