sqoop actual combat an incremental import hdfs

 1. Parameter description

 

serial number

command/command

kind

illustrate

1

import

ImportTool

Import data (from tables or queries) from relational databases into HDFS

2

export

ExportTool

Import data from HDFS into relational database

3

codegen

CodeGenTool

Get the data of a table in the database to generate Java and type it into a jar package

4

create-hive-table

CreateHiveTableTool

Create Hive table

5

eval

EvalSqlTool

View SQL execution results

6

import-all-tables

ImportAllTablesTool

Import all tables under a database into HDFS

7

job

JobTool

 

8

list-databases

ListDatabasesTool

List all database names

9

list-tables

ListTablesTool

List all tables in a database

10

merge

MergeTool

 

11

metastore

MetastoreTool

 

12

help

HelpTool

View help

13

version

VersionTool

View version

 

2. Import Common arguments general parameters, mainly for some parameters of relational database links

 

serial number

parameter

illustrate

Sample

1

connect

URL to connect to the relational database

jdbc:mysql://localhost/sqoop_datas

2

connection-manager

Connection management class, generally not used

 

3

driver

connection driver

 

4

hadoop-home 

hadoop directory

/home/hadoop

5

help

View help information

 

6

password

Password to connect to relational database

 

7

username

Username for linking relational database

 

8

verbose

To see more information, it is actually to lower the log level

There is no value after this parameter

import parameter

parameter describe
–connect < jdbc-uri > JDBC connection string
–connection-manager < class-name > Connection management class
–driver < class-name > Manually specify the JDBC driver class
–hadoop-mapred-home < dir > 可以覆盖$HADOOP_MAPRED_HOME
–help 使用帮助
–password-file 指定包含密码的文件
-P 执行import时会暂停,等待用户手动输入密码
–password < password > 直接将密码写在命令行中
–username < username > 指定用户名
–verbose 显示Sqoop任务更多执行信息
–connection-param-file < filename > 可选的参数,用于提供连接参数
–relaxed-isolation 设置每个mapmer的连接事务隔离

 

3。import控制参数

参数 描述
–append 导入的数据追加到数据文件中
–as-avrodatafile 导入数据格式为avro
–as-sequencefile 导入数据格式为sqeuqncefile
–as-textfile 导入数据格式为textfile
–boundary-query < statement > 代替min(split-by),max(split-by)值指定的边界,下面会有详细介绍
–columns < col,col,col… > 指定导入字段
–delete-target-dir 如果导入的target路径存在,则删除
–direct 使用direct模式
–fetch-size < n > 从数据库一次性读入的记录数
-inline-lob-limit < n > 设定大对象数据类型的最大值
-m, –num-mappers < n > 指定并行导入数据的map个数,默认为4个
-e, –query < statement > 导入查询语句得到的数据
–split-by < column-name > 一般与-m参数一起使用,指定分割split的字段
–table < table-name > 指定database中的表名
–target-dir < dir > 指定目标HDFS路径
–warehouse-dir < dir > 指定表目标路径所在路径
–where < where clause > 即sql语句中的where条件
-z, –compress 打开压缩功能
–compression-codec < c > 使用Hadoop的压缩,默认为gzip压缩
–null-string < null-string > 源表中为null的记录导入为string类型时显示为null-string,默认显示为”null”
–null-non-string < null-string > 源表中为null的记录导入为非string类型时显示为null-string,默认显示为”null”

4。增量导入

  sqoop支持两种增量导入到hive的模式, 一种是 append,即通过指定一个递增的列,比如: 
     --incremental append  --check-column id --last-value 0 
     另种是可以根据时间戳,比如: 
  --incremental lastmodified --check-column time --last-value '2013-01-01 11:0:00' 
  就是只导入time比'2013-01-01 11:0:00'更大的数据。

5。具体代码示例

/opt/softwares/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/bin/sqoop import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--connect "jdbc:sqlserver://10.10.0.3\\sql2008;database=LuxeDc" --username bgdbo --password bgdbo123 \
--table=Customer --target-dir /user/Customer \
--columns "CustomerID,CusCode,TrueName,LogDate" \
--fields-terminated-by "\t" \
--check-column "LogDate" \
--incremental "append" \
--last-value "2018-4-24 00:00:00" \

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324932516&siteId=291194637