One, sqoop
sqoop is an open source data migration tool that can import and export data between Hive and RDMS. It can also realize the function of data migration between HDFS and RDMS.
1.1 The working mechanism of Sqoop
1.1.1 Import mechanism
The Sqoop import operation is to import RDBMS data into HDFS.
When performing an import operation, Sqoop will read the columns and column data types in the table through JDBC, and then map the read data types to Java data types, and the underlying MapReduce will read from the database through the InputFormat object Data, and the DataDrivenDBInputFormat object can be divided into different map tasks according to the query results, and finally these tasks are sent to the MapReduce cluster for execution, and the results of the Map tasks will be filled into the corresponding instances.
1.1.2 Export mechanism
Sqoop export operation is to export data in HDFS to RDBMS. Before exporting, Sqoop will choose the export method, usually JDBC, and then Sqoop will generate a Java class that can parse the text data in HDFS and insert the corresponding value into the table. The JDBC-based export method will generate multiple insert statements, and each insert statement will insert multiple pieces of data into the table. At the same time, in order to ensure that different I/Os can be executed in parallel, when reading data from HDFS and communicating with the database, multiple threads are started to execute simultaneously.
1.2 Sqoop installation
Download address: http://archive.apache.org/dist/sqoop/, decompress after downloading.
1.2.1 Sqoop configuration
cd /export/servers/sqoop-1.4.7/conf
cp sqoop‐env‐template.sh sqoop‐env.sh
vi sqoop-env.sh
Add the following content to the sqoop-env.sh file:
export HADOOP_COMMON_HOME/export/servers/hadoop‐3.1.1
export HADOOP_MAPRED_HOME=/export/servers/hadoop‐3.1.1
export HIVE_HOME=/export/servers/apache‐hive‐3.1.1‐bin
1.2.2 Add dependent jar package
Sqoop needs to add the driver package of the database and the dependency package of java-json. After preparing the jar packages, add them to the lib directory of sqoop.
After the addition is complete, execute the following command to verify whether it is successful:
cd /export/servers/sqoop-1.4.7/
bin/sqoop-version
1.3 Data import
1.3.1 sqoop command
- List all databases in mysql:
bin/sqoop list‐databases ‐‐connect jdbc:mysql://192.168.31.7:3306/ ‐‐username root ‐‐password root
- Check which tables are in the mysql database:
bin/sqoop list‐tables ‐‐connect jdbc:mysql://192.168.31.7:3306/azkaban ‐‐username root ‐‐password root
- View the help document:
bin/sqoop list‐databases ‐‐help
1.3.2 Import example
- Table data:
create table emp(
id int primary key auto_increment,
name varchar(255) not null default '',
dep varchar(20) default '',
salary int default 0,
dept char(2) default ''
);
create table emp_add(
id int primary key auto_increment,
hon varchar(20) not null default '',
street varchar(20) default '',
city varchar(20) default ''
);
create table emp_conn(
id int primary key auto_increment,
phone varchar(11) not null default '',
email varchar(50) default ''
);
insert into emp values(1201, 'gopal', 'manager', 50000, 'TP');
insert into emp values(1202, 'manisha', 'proof reader', 50000, 'TP');
insert into emp values(1203, 'khalil', 'php dev', 30000, 'AC');
insert into emp values(1204, 'prasanth', 'php dev', 30000, 'AC');
insert into emp values(1205, 'kranthi', 'admin', 20000, 'TP');
insert into emp_add values(1201, '288A', 'vgiri', 'jublee');
insert into emp_add values(1202, '108I', 'aoc', 'sec-bad');
insert into emp_add values(1203, '144Z', 'pgutta', 'hyd');
insert into emp_add values(1204, '78B', 'old city', 'sec-bad');
insert into emp_add values(1205, '720X', 'hitec', 'sec-bad');
insert into emp_conn values(1201, '2356742', '[email protected]');
insert into emp_conn values(1202, '1661663', '[email protected]');
insert into emp_conn values(1203, '8887776', '[email protected]');
insert into emp_conn values(1204, '9988774', '[email protected]');
insert into emp_conn values(1205, '1231231', '[email protected]');
Import command:
# 将emp表数据导入到HDFS中
bin/sqoop import ‐‐connect jdbc:mysql://192.168.31.7:3306/azkaban ‐‐password root ‐‐username root ‐‐table emp ‐‐m 1
After the import is successful, execute the HDFS command to view the import result:
hdfs dfs -cat /user/root/emp/part*
You can also specify the -target-dir parameter to specify the exported HDFS directory. such as:
bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root --delete-target-dir --table emp --target-dir /sqoop/emp -m 1
By default sqoop uses a comma "," to separate the data in each column. If you want to specify the separator, you can specify the -fields-terminated-by parameter. such as:
bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root --delete-target-dir --table emp --target-dir /sqoop/emp2 -m 1 --fields-terminated-by '\t'
You can also export data to Hive, the export steps:
- First
hive‐exec‐3.1.1.jar
copy the package to the lib directory of sqoop;
cp /export/servers/apache‐hive‐3.1.1‐bin/lib/hive‐exec‐3.1.1.jar /export/servers/sqoop‐1.4.7/lib
- Before exporting, you need to create the required tables in hive;
create database sqooptohive;
use sqooptohive;
create external table emp_hive(id int,name string,dep string,salary int ,dept string) row format delimited fields terminated by '\001';
- Perform export operations;
bin/sqoop import ‐‐connect jdbc:mysql://192.168.31.7:3306/azkaban ‐‐username root ‐‐password root ‐‐table emp ‐‐fields‐terminated‐by '\001' ‐‐hive‐import ‐‐hive‐table sqooptohive.emp_hive ‐‐hive‐overwrite ‐‐delete‐target‐dir ‐‐m 1
Parameter description::
‐‐hive‐import
Specify that the command is to perform the import operation
--hive-table
;: The name of the Hive table to be exported;:
--hive-overwrite
Specify the source data to be overwritten ;:
-m
Specify how many map tasks are executed concurrently;
You can --hive-database
directly import mysql data and table structure into hive by specifying parameters;
bin/sqoop import ‐‐connect jdbc:mysql://192.168.31.7:3306/azkaban ‐‐username root ‐‐password root ‐‐table emp_conn ‐‐hive‐import ‐m 1 ‐‐hive‐database sqooptohive
If you only need to export data that meets the conditions, you can specify --where
parameters.
bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root --table emp_add --target‐dir /sqoop/emp_add -m 1 --delete‐target‐dir --where "city = 'sec‐bad'"
You can also specify --query
the SQL command to be executed by specifying parameters.
bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root -m 1 --query 'select * from emp_add where city="sec-bad" and $CONDITIONS' --target-dir /sqoop/emp_add --delete-target-dir
If you repeat the above import command, you will find that the subsequent import operation will overwrite the previously imported data. Sqoop also supports 增量导入
that the data imported later will not overwrite the previously imported data. If you are using an incremental import, you need to specify three --incremental
parameters: --check-column
, --last-value
, .
// 导入emp表中id大于1202的记录。
bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root --table emp --incremental append --check-column id --last-value 1202 -m 1 --target-dir /sqoop/increment
--where
More precise control can also be achieved through parameters.
bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root --table emp --incremental append --check-column id --where "id > 1202" -m 1 --target-dir /sqoop/increment
Note: Incremental import cannot specify the -delete-target-dir parameter.
1.4 Data export
Data export is to export data from HDFS to RDMBS. The content of the exported data is as follows:
Data export steps:
- Step 1: Create a table in the mysql database. The table fields must be the same as the type and number of data to be exported in HDFS;
Note: Before performing data export, the target table must already exist.
create table emp_out(id int, name varchar(100), dep varchar(50), sal int, dept varchar(10), create_time timestamp);
- Step 2: Perform export;
bin/sqoop export --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root --table emp_out --export-dir /sqoop/emp --input-fields-terminated-by ","