Big data study notes (7)

One, sqoop

sqoop is an open source data migration tool that can import and export data between Hive and RDMS. It can also realize the function of data migration between HDFS and RDMS.
Insert picture description here

1.1 The working mechanism of Sqoop

1.1.1 Import mechanism

The Sqoop import operation is to import RDBMS data into HDFS.
When performing an import operation, Sqoop will read the columns and column data types in the table through JDBC, and then map the read data types to Java data types, and the underlying MapReduce will read from the database through the InputFormat object Data, and the DataDrivenDBInputFormat object can be divided into different map tasks according to the query results, and finally these tasks are sent to the MapReduce cluster for execution, and the results of the Map tasks will be filled into the corresponding instances.

1.1.2 Export mechanism

Sqoop export operation is to export data in HDFS to RDBMS. Before exporting, Sqoop will choose the export method, usually JDBC, and then Sqoop will generate a Java class that can parse the text data in HDFS and insert the corresponding value into the table. The JDBC-based export method will generate multiple insert statements, and each insert statement will insert multiple pieces of data into the table. At the same time, in order to ensure that different I/Os can be executed in parallel, when reading data from HDFS and communicating with the database, multiple threads are started to execute simultaneously.

1.2 Sqoop installation

Download address: http://archive.apache.org/dist/sqoop/, decompress after downloading.

1.2.1 Sqoop configuration

cd /export/servers/sqoop-1.4.7/conf
cp sqoop‐env‐template.sh sqoop‐env.sh
vi sqoop-env.sh

Add the following content to the sqoop-env.sh file:

export HADOOP_COMMON_HOME/export/servers/hadoop‐3.1.1
export HADOOP_MAPRED_HOME=/export/servers/hadoop‐3.1.1
export HIVE_HOME=/export/servers/apache‐hive‐3.1.1‐bin

1.2.2 Add dependent jar package

Sqoop needs to add the driver package of the database and the dependency package of java-json. After preparing the jar packages, add them to the lib directory of sqoop.

After the addition is complete, execute the following command to verify whether it is successful:

cd /export/servers/sqoop-1.4.7/
bin/sqoop-version

1.3 Data import

1.3.1 sqoop command

  • List all databases in mysql:
bin/sqoop list‐databases ‐‐connect jdbc:mysql://192.168.31.7:3306/ ‐‐username root ‐‐password root
  • Check which tables are in the mysql database:
bin/sqoop list‐tables ‐‐connect jdbc:mysql://192.168.31.7:3306/azkaban ‐‐username root ‐‐password root
  • View the help document:
bin/sqoop list‐databases ‐‐help

1.3.2 Import example

  • Table data:
create table emp(
	id int primary key auto_increment,
	name varchar(255) not null default '',
	dep varchar(20) default '',
	salary int default 0,
	dept char(2) default ''
);

create table emp_add(
	id int primary key auto_increment,
	hon varchar(20) not null default '',
	street varchar(20) default '',
	city varchar(20) default ''
);

create table emp_conn(
	id int primary key auto_increment,
	phone varchar(11) not null default '',
	email varchar(50) default ''
);

insert into emp values(1201, 'gopal', 'manager', 50000, 'TP');
insert into emp values(1202, 'manisha', 'proof reader', 50000, 'TP');
insert into emp values(1203, 'khalil', 'php dev', 30000, 'AC');
insert into emp values(1204, 'prasanth', 'php dev', 30000, 'AC');
insert into emp values(1205, 'kranthi', 'admin', 20000, 'TP');

insert into emp_add values(1201, '288A', 'vgiri', 'jublee');
insert into emp_add values(1202, '108I', 'aoc', 'sec-bad');
insert into emp_add values(1203, '144Z', 'pgutta', 'hyd');
insert into emp_add values(1204, '78B', 'old city', 'sec-bad');
insert into emp_add values(1205, '720X', 'hitec', 'sec-bad');

insert into emp_conn values(1201, '2356742', '[email protected]');
insert into emp_conn values(1202, '1661663', '[email protected]');
insert into emp_conn values(1203, '8887776', '[email protected]');
insert into emp_conn values(1204, '9988774', '[email protected]');
insert into emp_conn values(1205, '1231231', '[email protected]');

Import command:

# 将emp表数据导入到HDFS中
bin/sqoop import ‐‐connect jdbc:mysql://192.168.31.7:3306/azkaban ‐‐password root ‐‐username root ‐‐table emp ‐‐m 1

After the import is successful, execute the HDFS command to view the import result:

hdfs dfs -cat /user/root/emp/part*

You can also specify the -target-dir parameter to specify the exported HDFS directory. such as:

bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root --delete-target-dir --table emp --target-dir /sqoop/emp -m 1

By default sqoop uses a comma "," to separate the data in each column. If you want to specify the separator, you can specify the -fields-terminated-by parameter. such as:

bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password  root --delete-target-dir --table emp --target-dir /sqoop/emp2 -m 1 --fields-terminated-by '\t'

You can also export data to Hive, the export steps:

  • First hive‐exec‐3.1.1.jarcopy the package to the lib directory of sqoop;
cp /export/servers/apache‐hive‐3.1.1‐bin/lib/hive‐exec‐3.1.1.jar /export/servers/sqoop‐1.4.7/lib
  • Before exporting, you need to create the required tables in hive;
create database sqooptohive;
use sqooptohive;
create external table emp_hive(id int,name string,dep string,salary int ,dept string) row format delimited fields terminated by '\001';
  • Perform export operations;
bin/sqoop import ‐‐connect jdbc:mysql://192.168.31.7:3306/azkaban ‐‐username root ‐‐password root ‐‐table emp ‐‐fields‐terminated‐by '\001' ‐‐hive‐import ‐‐hive‐table sqooptohive.emp_hive ‐‐hive‐overwrite ‐‐delete‐target‐dir ‐‐m 1

Parameter description::
‐‐hive‐importSpecify that the command is to perform the import operation
--hive-table;: The name of the Hive table to be exported;:
--hive-overwriteSpecify the source data to be overwritten ;:
-mSpecify how many map tasks are executed concurrently;

You can --hive-databasedirectly import mysql data and table structure into hive by specifying parameters;

bin/sqoop import ‐‐connect jdbc:mysql://192.168.31.7:3306/azkaban ‐‐username root ‐‐password root ‐‐table emp_conn ‐‐hive‐import ‐m 1 ‐‐hive‐database sqooptohive

If you only need to export data that meets the conditions, you can specify --whereparameters.

bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root --table emp_add --target‐dir /sqoop/emp_add -m 1 --delete‐target‐dir --where "city = 'sec‐bad'"

You can also specify --querythe SQL command to be executed by specifying parameters.

bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root -m 1 --query 'select * from emp_add where city="sec-bad" and $CONDITIONS' --target-dir /sqoop/emp_add --delete-target-dir

If you repeat the above import command, you will find that the subsequent import operation will overwrite the previously imported data. Sqoop also supports 增量导入that the data imported later will not overwrite the previously imported data. If you are using an incremental import, you need to specify three --incrementalparameters: --check-column, --last-value, .

// 导入emp表中id大于1202的记录。
bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root --table emp --incremental append --check-column id --last-value 1202 -m 1 --target-dir /sqoop/increment

--whereMore precise control can also be achieved through parameters.

bin/sqoop import --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root --table emp --incremental append --check-column id --where "id > 1202" -m 1 --target-dir /sqoop/increment

Note: Incremental import cannot specify the -delete-target-dir parameter.

1.4 Data export

Data export is to export data from HDFS to RDMBS. The content of the exported data is as follows:
Insert picture description here

Data export steps:

  • Step 1: Create a table in the mysql database. The table fields must be the same as the type and number of data to be exported in HDFS;

Note: Before performing data export, the target table must already exist.

create table emp_out(id int, name varchar(100), dep varchar(50), sal int, dept varchar(10), create_time timestamp);
  • Step 2: Perform export;
bin/sqoop export --connect jdbc:mysql://192.168.31.7:3306/azkaban --username root --password root --table emp_out --export-dir /sqoop/emp --input-fields-terminated-by ","

Guess you like

Origin blog.csdn.net/zhongliwen1981/article/details/107342672