sqoop知识整理

Sqoop

Sqoop是一款开源的工具，主要用于在HADOOP不传统的数据库(mysql、postgresql等)进行数据的传递，可以将一个关系型数据库（例如：MySQL、Oracle、Postgres等）中的数据导进到Hadoop的HDFS中，也可以将HDFS的数据导进到关系型数据库中。 Sqoop中一大亮点就是可以通过hadoop的mapreduce把数据从关系型数据库中导入数据到HDFS。

版本选择

CDH 5.3.x 版本，非常的稳定，好用 cdh-5.3.6，各个版本之间的依赖和兼容不同
* hadoop-2.5.0-cdh5.3.6.tar.gz * hive-0.13.1-cdh5.3.6.tar.gz * zookeeper-3.4.5-cdh5.3.6.tar.gz * sqoop-1.4.5-cdh5.3.6.tar.gz 下载地址 http://archive.cloudera.com/cdh5/cdh/5/

sqoop安装

tar -zxvf sqoop-1.4.5-cdh5.3.6.tar.gz -C /opt/cdh5.3.6
#在sqoop下的conf目录下修改sqoop-env.sh

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6

#set the path to where bin/hbase is available
#export HBASE_HOME=

#Set the path to where bin/hive is available
export HIVE_HOME=/opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6

#Set the path for where zookeper config dir is
#export ZOOCFGDIR=

#RDBMS以Mysql数据库为例讲解，拷贝jdbc驱动包到$SQOOP_HOME/lib目录下
cp hive-0.13.1-cdh5.3.6/lib/mysql-connector-java-5.1.27-bin.jar  sqoop-1.4.5-cdh5.3.6/lib/

sqoop常用命令

sqoop-import 注意点 import 可能会用到的参数：

Argument	Described  
--append	Append data to an existing dataset in HDFS

--as-sequencefile	import序列化的文件  

--as-textfile	import plain文件 ，默认  

--columns <col,col,col…>	指定列import，逗号分隔，比如：--columns "id,name"  

--delete-target-dir	删除存在的import目标目录  

--direct	直连模式，速度更快（HBase不支持）

--fetch-size <n>	一次从数据库读取 n 个实例，即n条数据

-m,--num-mappers <n>	建立 n 个并发执行task import

-e,--query <statement>	构建表达式<statement>执行

--split-by <column-name>	根据column分隔实例

--autoreset-to-one-mappe	如果没有主键和split-by 用one mapper import （split-by 和此选项不共存）

--table <table-name>	指定表名import

--target-dir <d>	HDFS destination dir

--warehouse-dir <d>	HDFS parent for table destination

--where <where clause>	指定where从句，如果有双引号，注意转义 \$CONDITIONS，不能用or，子查询，join

-z,--compress	开启压缩

--null-string <null-string>	string列为空指定为此值

--null-non-string <null-string>	非string列为空指定为此值，-null这两个参数are optional, 如果不设置，会指定为"null"

**************************************
列出mysql数据库中的所有数据库
bin/sqoop list-databases --connect jdbc:mysql://hadoop.jianxin.com:3306 --username root --password missandlove
**************************************
    
**************************************
建立mysql表
CREATE TABLE my_user(
  `id` int(4) NOT NULL AUTO_INCREMENT,
  `account` varchar(255) DEFAULT NULL,
  `passwd` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

插入数据
INSERT INTO `my_user` VALUES ('1', 'admin', 'admin');
INSERT INTO `my_user` VALUES ('3', 'system', 'system');
INSERT INTO `my_user` VALUES ('5', 'lee', 'lee');
INSERT INTO `my_user` VALUES ('6', 'les', 'les');
INSERT INTO `my_user` VALUES ('7', 'jianxin', 'jianxin');


**************************************

**************************************
mysql导出数据
bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com:3306/test \
--username root \
--password missandlove \
--table my_user

如果不指定–target-dir，数据会存储在HDFS中的对应用户名目录下 
**************************************

**************************************
导出到指定文件目录下
sqoop 底层的实现就是MapReduce,import来说，仅仅运行Map Task

bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com:3306/test \
--username root \
--password missandlove \
--table my_user \
--delete-target-dir \
--target-dir /user/jianxin/sqoop/import_myuser \
--num-mappers 3
**************************************

**************************************
数据导出储存方式（数据存储文件格式---（ textfil parquet）
--as-textfile	Imports data as plain text (default)
--as-parquetfile	Imports data to Parquet Files
）

bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com:3306/test \
--username root \
--password missandlove \
--table my_user \
--target-dir /user/jianxin/sqoop/import_myuser_parquet \
--delete-target-dir \
--num-mappers 3 \
--fields-terminated-by ',' \
--as-parquetfile
**************************************


**************************************
hive建立表

drop table if exists default.hive_user_prq;
create table default.hive_user_prq(
id int,
username string,
password string
)
ROW FORMAT DELIMITED FIELDS TERMINATED  BY ','
STORED AS parquet;

数据入表
load data inpath '/user/jianxin/sqoop/import_myuser_parquet' into table default.hive_user_prq;
**************************************


**************************************
指定的列导入到数据库

bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com:3306/test \
--username root \
--password missandlove \
--table my_user \
--delete-target-dir \
--target-dir /user/jianxin/sqoop/import_column \
--as-parquetfile \
--num-mappers 3 \
--fields-terminated-by ',' \
--column id,account \

**************************************

**************************************
初步清洗和过滤数据

bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com:3306/test \
--username root \
--password missandlove \
--table my_user \
--delete-target-dir \
--target-dir /user/jianxin/sqoop/import_query \
--as-parquetfile \
--num-mappers 3 \
--fields-terminated-by ',';
--query 'select id, account from my_user where $CONDITIONS'
使用query查询导入必须指定条件$CONDITIONS，如果不指定将报错。 
create table query_table(id int ,account string) ROW FORMAT DELIMITED FIELDS TERMINATED  BY ',' STORED AS parquet;
load data inpath '/user/jianxin/sqoop/import_query' into table query_table;  

bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com:3306/test \
--username root \
--password missandlove \
--table my_user \
--delete-target-dir \
--target-dir /user/jianxin/sqoop/import_query \
--as-parquetfile \
--num-mappers 3 \
--fields-terminated-by ',';
--query 'select id, account from my_user where id>1'
**************************************

**************************************
数据压缩

bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com/test \
--username root \
--password missandlove \
--table my_user  \
--target-dir  /user/jianxin/sqoop/imp_my_sannpy \
-- delete-target-dir \
--num-mappers 3 \
--compress \
-- compression-codec  org.apache.hadoop.io.compress.SnappyCodec \
--fields-terminated-by ',' \
--as-parquetfile
**************************************

**************************************
数据的增量导入
 sqoop
Incremental import arguments:
--check-column <column>        Source column to check for incremental change
--incremental <import-type>    Define an incremental import of type 'append' or 'lastmodified'
--last-value <value>         Last imported value in the incremental check column	
–check-column，用来指定一些列，这些列在导入时用来检查做决定数据是否要被作为增量数据，在一般关系型数据库中，都存在类似Last_Mod_Date的字段或主键。注意：这些被检查的列的类型不能是任意字符类型，例如Char，VARCHAR…（即字符类型不能作为增量标识字段） 
–incremental，用来指定增量导入的模式（Mode），append和lastmodified 
–last-value，指定上一次导入中检查列指定字段最大值

1、append，在导入的新数据ID值是连续时采用，对数据进行附加 加不加–last-value的区别在于：数据是否冗余，如果不加，则会导入源表中的所有数据导致数据冗余。
2、lastmodified，在源表中有数据更新的时候使用，检查列就必须是一个时间戳或日期类型的字段，更新完之后，last-value会被设置为执行增量导入时的当前系统时间，当使用–incremental lastmodified模式进行导入且导入目录已存在时，需要使用–merge-key或–append 
导入>=last-value的值。--incremental lastmodified --check-column created --last-value '2012-02-01 11:0:00'
就是只导入created 比'2012-02-01 11:0:00'更大的数据。


bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com:3306/test \
--username root \
--password missandlove \
--table my_user \
--target-dir /user/jianxin/sqoop/increament \
--num-mappers 1 \
--incremental append \
--check-column id \
--last-value 0

[root@hadoop hadoop-2.5.0-cdh5.3.6]# bin/hdfs dfs -cat /user/jianxin/sqoop/increament/part-m-00000
1,admin,admin
3,system,system
5,lee,lee
6,les,les
7,jianxin,jianxin


INSERT INTO `my_user` VALUES ('12', 'admin', 'admin');
INSERT INTO `my_user` VALUES ('11', 'system', 'system');
INSERT INTO `my_user` VALUES ('10', 'lee', 'lee');
INSERT INTO `my_user` VALUES ('8', 'les', 'les');
INSERT INTO `my_user` VALUES ('9', 'jianxin', 'jianxin');
bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com:3306/test \
--username root \
--password missandlove \
--table my_user \
--target-dir /user/jianxin/sqoop/increament \
--num-mappers 1 \
--incremental append \
--check-column id \
--last-value 7
[root@hadoop hadoop-2.5.0-cdh5.3.6]# bin/hdfs dfs -cat /user/jianxin/sqoop/increament/part-m-00001
8,les,les
9,jianxin,jianxin
10,lee,lee
11,system,system
12,admin,admin


create table customer(id int,name varchar(20),last_mod timestamp DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP);
insert into customer(id,name) values(1,'neil');
insert into customer(id,name) values(2,'jack');
insert into customer(id,name) values(3,'martin');
insert into customer(id,name) values(4,'tony');
insert into customer(id,name) values(5,'eric');

bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com/test \
--username root \
--password missandlove \
--table customer \
--target-dir /user/jianxin/sqoop/increament_bytime \
--num-mappers 1 \
--check-column last_mod \
--incremental lastmodified  \
--last-value  0

[root@hadoop hadoop-2.5.0-cdh5.3.6]# bin/hdfs dfs -cat /user/jianxin/sqoop/increament_bytime/part-m-00000
1,neil,2018-04-26 14:00:08.0
2,jack,2018-04-26 14:00:08.0
3,martin,2018-04-26 14:00:08.0
4,tony,2018-04-26 14:00:08.0
5,eric,2018-04-26 14:00:09.0

使用lastmodified模式进行增量处理要指定增量数据是以append模式(附加)还是merge-key(合并)模式添加 
insert into customertest(id,name) values(6,'james')
bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com/test \
--username root \
--password missandlove \
--table customer \
--target-dir /user/jianxin/sqoop/increament_bytime \
--num-mappers 1 \
--check-column last_mod \
--incremental lastmodified  \
--last-value  '2018-04-26 14:00:09' \
--merge-key id 
这种方式不会新建立hdfs文件系统下的文件


**************************************
直接导入
bin/sqoop import \
--connect jdbc:mysql://hadoop.jianxin.com:3306/test \
--username root \
--password missandlove \
--table customer \
--num-mappers 1 \
--target-dir /user/jianxin/sqoop/derict \
--delete-target-dir
--direct
	
**************************************

**************************************
导出到数据库

论据	描述
--connect <jdbc-uri>	指定JDBC连接字符串
--connection-manager <class-name>	指定要使用的连接管理器类
--driver <class-name>	手动指定要使用的JDBC驱动程序类
--hadoop-mapred-home <dir>	覆盖$ HADOOP_MAPRED_HOME
--help	打印使用说明
--password-file	为包含认证密码的文件设置路径
-P	从控制台读取密码
--password <password>	设置验证密码
--username <username>	设置验证用户名
--verbose	在工作时打印更多信息
--connection-param-file <filename>	提供连接参数的可选属性文件
--relaxed-isolation	将连接事务隔离设置为未提交给映射器的读取。
--columns <col,col,col…>	要导出到表格的列
--direct	使用直接导出快速路径
--export-dir <dir>	用于导出的HDFS源路径
-m,--num-mappers <n>	使用n个地图任务并行导出
--table <table-name>	要填充的表
--call <stored-proc-name>	存储过程调用
--update-key <col-name>	锚点列用于更新。如果有多个列，请使用以逗号分隔的列列表。
--update-mode <mode>	指定在数据库中使用不匹配的键找到新行时如何执行更新。
mode包含的 updateonly默认 值（默认）和 allowinsert。
--input-null-string <null-string>	字符串列被解释为空的字符串
--input-null-non-string <null-string>	要对非字符串列解释为空的字符串
--staging-table <staging-table-name>	数据在插入目标表之前将在其中展开的表格。
--clear-staging-table	表示可以删除登台表中的任何数据。
--batch	使用批处理模式执行基础语句。

touch /opt/datas/my_user_fromhive
[root@hadoop datas]# vi my_user_fromhive 
16,hiveuser,hiveuser
18,hivemiss,hivemiss
[root@hadoop hadoop-2.5.0-cdh5.3.6]# bin/hdfs dfs -mkdir -p /user/jianxin/hive_to_mysql
[root@hadoop hadoop-2.5.0-cdh5.3.6]# bin/hdfs dfs -put /opt/datas/my_user_fromhive  /user/jianxin/hive_to_mysql


bin/sqoop export \
--connect  jdbc:mysql://hadoop.jianxin.com:3306/test \
--username root \
--password missandlove \
--table my_user \
--export-dir /user/jianxin/hive_to_mysql \
--num-mappers 1
**************************************

Sqoop

版本选择

sqoop安装

sqoop常用命令

猜你喜欢