Sqoop optimization for the perfect landing of big data from 0 to 1

Job and optimization of Sqoop

Job operation

job的好处:
1、一次创建,后面不需要创建,可重复执行job即可
2、它可以帮我们记录增量导入数据的最后记录值
3、job的元数据默认存储目录:$HOME/.sqoop/
4、job的元数据也可以存储于mysql中。
copy code

Sqoop provides a series of Job statements to operate Sqoop.

$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
copy code

Instructions:

usage: sqoop job [GENERIC-ARGS] [JOB-ARGS] [-- [<tool-name>] [TOOL-ARGS]]

Job management arguments:
   --create <job-id>            Create a new saved job
   --delete <job-id>            Delete a saved job
   --exec <job-id>              Run a saved job
   --help                       Print usage instructions
   --list                       List saved jobs
   --meta-connect <jdbc-uri>    Specify JDBC connect string for the
                                metastore
   --show <job-id>              Show the parameters for a saved job
   --verbose                    Print more information while working

copy code

List the jobs of Sqoop:

[root@qianfeng01 sqoop-1.4.7] sqoop job --list
copy code

Create a Sqoop Job:

[root@qianfeng01 sqoop-1.4.7]# sqoop job --create sq2 -- import  --connect jdbc:mysql://qianfeng01:3306/qfdb \
--username root \
--password 123456 \
--table u2 \
--delete-target-dir \
--target-dir '/sqoopdata/u3' \
-m 1

注意:第一行的--与import之间有空格
copy code

Execute Sqoop's Job:

#如报错json包找不到,则需要手动添加
sqoop job --exec sq1


执行的时候会让输入密码:
输入该节点用户的对应的密码即可
# 1、配置客户端记住密码(sqoop-site.xml)追加
 <property>
    <name>sqoop.metastore.client.record.password</name>
    <value>true</value>
  </property>

# 2、将密码配置到hdfs的某个文件,我们指向该密码文件
说明:在创建Job时,使用--password-file参数,而且非--passoword。主要原因是在执行Job时使用--password参数将有警告,并且需要输入密码才能执行Job。当我们采用--password-file参数时,执行Job无需输入数据库密码。
[root@qianfeng01 sqoop-1.4.7]# echo -n "123456" > sqoop.pwd
[root@qianfeng01 sqoop-1.4.7]# hdfs dfs -mkdir /input
[root@qianfeng01 sqoop-1.4.7]# hdfs dfs -put sqoop.pwd /input/sqoop.pwd
[root@qianfeng01 sqoop-1.4.7]# hdfs dfs -chmod 400 /input/sqoop.pwd
[root@qianfeng01 sqoop-1.4.7]# hdfs dfs -ls /input
-r-------- 1 hadoop supergroup 6 2018-01-15 18:38 /input/sqoop.pwd

# 3. 重新创建Job
sqoop job --create u2 -- import --connect jdbc:mysql://qianfeng01:3306/qfdb --username root --table u2 --delete-target-dir --target-dir '/sqoopdata/u3' -m 1 --password-file '/input/sqoop.pwd'
copy code

View Sqoop's Job:

[root@qianfeng01 sqoop-1.4.7] sqoop job --show sq1
copy code

Delete the job of Sqoop:

[root@qianfeng01 sqoop-1.4.7] sqoop job --delete sq1
copy code

question:

1 创建job报错:19/12/02 23:29:17 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException
java.lang.NullPointerException
        at org.json.JSONObject.<init>(JSONObject.java:144)

解决办法:
添加java-json.jar包到sqoop的lib目录中。
如果上述办法没有办法解决,请注意hcatlog的版本是否过高,过高将其hcatlog包剔除sqoop的lib目录即可。

2 报错:Caused by: java.lang.ClassNotFoundException: org.json.JSONObject
解决办法:
添加java-json.jar包到sqoop的lib目录中。
copy code

Sqoop optimization

Optimization of -m and split-by

  1. When a small amount of data (about 200M): It is best to use a map one by one, which is fast and reduces small files.
  2. When there is a large amount of data: Special consideration should be given to the characteristics of the data. The most perfect situation for split-by is to have a uniformly distributed number (such as an auto-increment column) or a time field, and this field also has an index (the best field is int, tinyin), so that each concurrent sq1 processes a similar amount of data during extraction, and the additional where condition of Sqoop can use the index.
  3. split-by id, -m 2, data size 1-100. The first mapper: (0,50] The second mapper: (50, 100], for m, the amount of data, I0, the performance of the source database, the resources of the cluster, etc. should be considered comprehensively. A simple consideration is that the maximum difference Exceeding the number of cores allocated to this user on yarn., the minimum "data volume/m" should be enough for a 128MB file. If conditions permit, you can set a value and try it out first, then observe the source database load, cluster I0 and running Wait for a while and adjust accordingly.

--fetch-size n

Get the number of batches of data read in MySql at a time. It is recommended to optimize as follows:

  1. Consider the volume of a piece of data. (If the --fetch-size of 2 fields and 200 fields cannot be the same)
  2. Consider the performance of the database
  3. Consider Internet Speed
  4. The best state is once --fetch-size can satisfy a mapper

 For more exciting content of big data, welcome to search " Qianfeng Education " at station B or scan the code to get a full set of materials

Guess you like

Origin blog.csdn.net/longz_org_cn/article/details/131302716