Hadoop of big data technology (10) - Sqoop data migration

Table of contents

1. Overview of Sqoop

1. Understanding Sqoop

2. Principle of Sqoop

(1) Introduction principle

(2) Derivation principle

2. Sqoop installation and configuration

1. Download and install

2. MySQL configuration start

3. Configure the Sqoop environment

4. Sqoop effect test

 3. Sqoop data import

1. Import MySQL table data into HDFS

2. Incrementally import MySQL table data into HDFS

3. Import MySQL table data into Hive

4. MySQL table data subset import 

4. Sqoop data export

reference books


1. Overview of Sqoop

1. Understanding Sqoop

        Sqoop is an open source tool under Apache. The project started in 2009 as a third-party module of Hadoop. Later, in order to allow users to quickly deploy and developers to iteratively develop faster, in 2013 In 2010, it independently became a top open source project of Apache.

        Sqoop is mainly used to transfer data between Hadoop and relational databases or large machines. You can use the Sqoop tool to import data from relational database management systems into Hadoop distributed file systems, or convert and export data in Hadoop to relational database management systems. .

        Currently, Sqoop is mainly divided into two versions, Sqoop1 and Sqoop2. The version number 1.4.x belongs to Sqoop1, and the version number 1.99.x belongs to Sqoop2. The two versions are developed in different orientations and have very different architectures, so they are not compatible with each other.

        Sqoop1 has a simple functional structure, is easy to deploy, and provides command line operations. It is mainly suitable for system service managers to perform simple data migration operations; Sqoop2 has complete functions, easy operation, and supports multiple access modes (command line operations, Web access, and Rest API), introducing a role security mechanism to increase security and other advantages, but the result is complicated and the configuration and deployment are more cumbersome.

2. Principle of Sqoop

        Sqoop is a tool for data synchronization between traditional relational database servers and Hadoop. Its underlying layer uses the MapReduce parallel computing model to speed up data transmission in batch processing and has better fault tolerance.

        Call the Sqoop tool through the client CLI (command line interface) or Java API, and Sgoop can convert instructions into corresponding MapReduce jobs (usually only involving Map tasks, each Map task reads a piece of data from the database, so multiple The Map task implements concurrent replication, which can quickly copy the entire data to HDFS), and then converts the data in the relational database and Hadoop to complete the data migration.

        It can be said that Sqoop is a data bridge between relational databases and Hadoop. The important component of this bridge is the Sqoop connector, which is used to realize the connection with various relational databases, so as to realize the data import and export operations. The Sqoop connector can support most commonly used relational databases, such as MySQL, Oracle, DB2, and SQL Server, etc., and it also has a general JDBC connector for connecting to databases that support the JDBC protocol.

(1) Introduction principle

        Before importing data, Sqoop uses JDBC to check the imported data table, retrieves all columns in the table and the SQL data types of the columns, and maps these SQL types to Java data types, and uses these in the converted MapReduce application The corresponding Java type is used to save the value of the field, and Sqoop's code generator uses this information to create a class corresponding to the table, which is used to save the records extracted from the table.

(2) Derivation principle

        Before exporting data, Sqoop will choose an export method according to the database connection string. For most systems, Sqoop will choose JDBC. Sqoop will generate a Java class according to the definition of the target table. This generated class can parse the record data from the text and insert values ​​of the appropriate type into the table, and then start a MapReduce job to read the source data file from HDFS , parses out the record using the generated class, and executes the selected exported method.

2. Sqoop installation and configuration

1. Download and install

Download sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz to the /export/software/ directory.

        Official website address

http://sqoop.apache.org/icon-default.png?t=MBR7http://sqoop.apache.org/      

          sqoop-1.4.7 load address

http://archive.apache.org/dist/sqoop/1.4.7/icon-default.png?t=MBR7http://archive.apache.org/dist/sqoop/1.4.7/

Enter the directory /export/software/, and execute the following command to decompress sqoop-1.4.7 to /export/servers/.

tar -xzvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /export/servers/

Rename the installation directory sqoop-1.4.7.bin__hadoop-2.6.0 to sqoop-1.4.7 

进入目录/export/servers/,执行命令
cd /export/servers/
mv sqoop-1.4.7.bin__hadoop-2.6.0 sqoop-1.4.7

Copy related jar packages of hive 

cp $HIVE_HOME/lib/hive-common-1.2.2.jar /export/servers/sqoop-1.4.7/lib
cp $HIVE_HOME/lib/hive-shims*.jar /export/servers/sqoop-1.4.7/lib

2. MySQL configuration start

MariaDB 是替换相同 MySQL 版本的二进制代码,是 MySQL 中的增强型替代品。系统中已经安装了MariaDB。
设置开机自启
systemctl enable mariadb  

打开数据库服务
systemctl start mariadb   

设置root密码与授权
如果root密码没有设置过,需要执行下列命令,将数据库超级用户 root 密码设置为 123456
mysql
mysql>use mysql;
mysql>update user set Password=PASSWORD('123456') where user='root';
mysql>grant all PRIVILEGES on *.* to 'root'@'% ' identified by '123456' with grant option;
mysql>FLUSH PRIVILEGES;
mysql>quit

修改密码后,数据库的登录命令如下:
mysql -uroot -p123456

3. Configure the Sqoop environment

Rename the generated sqoop-env.sh configuration file and add Hadoop environment variables

cd /export/servers/sqoop-1.4.7/conf 
mv sqoop-env-template.sh sqoop-env.sh

Modify the sqoop-env.sh configuration file and add Hadoop environment variables. The details are as follows:

export HADOOP_COMMON_HOME=/export/servers/hadoop-2.10.1
export HADOOP_MAPRED_HOME=/export/servers/hadoop-2.10.1
export HIVE_HOME=/export/servers/hive
export ZOOKEEPER_HOME=/export/servers/zookeeper
export ZOOCFGDIR=/export/servers/zookeeper/conf

 Modify /etc/profile, add Sqoop execution path

vi /etc/profile

添加如下2行:
export SQOOP_HOME=/export/servers/sqoop-1.4.7
export PATH=$PATH:$SQOOP_HOME/bin

执行如下指令,刷新配置文件:
source /etc/profile 

Copy the JDBC driver package of the mysql database to /export/servers/sqoop-1.4.7/lib

之前配置Hive数据仓库的时候已经将驱动包放入hive的lib目录下,这里我们可以复制过来
cp -r /export/servers/hive/lib/mysql-connector-java-8.0.20.jar /export/servers/sqoop-1.4.7/lib/

Verify that the installation was successful 

sqoop version 

         Let me explain here that some software is not configured because it is not used and does not affect subsequent operations.

4. Sqoop effect test

执行如下指令,列出mysql数据库中的所有数据库名。
sqoop list-databases -connect jdbc:mysql://localhost:3306/ --username root --password 123456

执行如下指令,列出mysql数据库中数据库mysql的所有数据表名。
sqoop list-tables -connect jdbc:mysql://localhost:3306/mysql --username root --password 123456

The effect is as follows:

 

 3. Sqoop data import

        Create userdb database in MySQL database, create 3 tables: emp, emp_add, and emp_conn, and import initialization data.

在Sqoop目录下创建一个data目录,用来存放执行文件的
mkdir -p /export/servers/sqoop-1.4.7/data

进入data目录创建一个名为userdb.sql的文件
在里面填写Mysql的相关代码,代码如下:

Create database if not exists  userdb DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

use userdb;

DROP TABLE IF EXISTS `emp`; 
CREATE TABLE emp ( 
	id int(11) NOT NULL, 
	name varchar(100) DEFAULT NULL, 
	deg varchar(100) DEFAULT NULL,
	salary int(11) DEFAULT NULL, 
	dept varchar(10) DEFAULT NULL,
	PRIMARY KEY (id)
);

insert into emp values ('1201', 'gopal', 'manager', 50000, 'TP');
insert into emp values ('1202', 'manisha', 'Proof reader', 50000, 'TP');
insert into emp values ('1203', 'khalil', 'php dev', 50000, 'AC');
insert into emp values ('1204', 'prasanth', 'php dev', 50000, 'AC');
insert into emp values ('1205', 'kranthi', 'admin', 50000, 'TP');

DROP TABLE IF EXISTS `emp_add`; 
CREATE TABLE emp_add ( 
	id int(11) NOT NULL, 
	hno varchar(100) DEFAULT NULL, 
	street varchar(100) DEFAULT NULL,
	city varchar(100) DEFAULT NULL, 
	PRIMARY KEY (id)
);

insert into emp_add values ('1201', '288A', 'vgiri', 'jublee');
insert into emp_add values ('1202', '1081', 'aoc', 'sec-bad');
insert into emp_add values ('1203', '144Z', 'pgutta', 'hyd');
insert into emp_add values ('1204', '78B', 'old city', 'sec-bad');
insert into emp_add values ('1205', '720X', 'hitec', 'sec-bad');

DROP TABLE IF EXISTS `emp_conn`; 
CREATE TABLE emp_conn ( 
	id int(11) NOT NULL, 
	phno varchar(100) DEFAULT NULL, 
	email varchar(100) DEFAULT NULL,
	PRIMARY KEY (id)
);

insert into emp_conn values ('1201', '2356742', '[email protected]');
insert into emp_conn values ('1202', '1661663', '[email protected]');
insert into emp_conn values ('1203', '8887776', '[email protected]');
insert into emp_conn values ('1204', '9988774', '[email protected]');
insert into emp_conn values ('1205', '1231231', '[email protected]');

show tables;

接着启动Mysql数据库执行以下操作
MariaDB [(none)]>  source userdb.sql

执行完成后。会显示如下3个表名:
+------------------+
| Tables_in_userdb |
+------------------+
| EMP              |
| emp              |
| emp_add          |
| emp_conn         |
+------------------+
4 rows in set (0.00 sec)

注:userdb.sql应在当前目录下。

1. Import MySQL table data into HDFS

执行如下指令,列出mysql数据库中的所有数据库名。
sqoop import \
--connect jdbc:mysql://hadoop01.bgd01:3306/userdb \
--username root \
--password=123456 \
--table emp \
--target-dir /sqoopresult \
--num-mappers 1

查看导入结果
hdfs dfs -cat /sqoopresult/part-m-00000

2. Incrementally import MySQL table data into HDFS

insert into emp values ('1206', 'itcast', 'java dev', 50000, 'AC');

sqoop import \
--connect jdbc:mysql://hadoop01.bgd01:3306/userdb \
--username root \
--password=123456 \
--table emp \
--target-dir /sqoopresult \
--num-mappers 1 \
--incremental append \
--check-column id \
--last-value 1205

查看导入结果
hdfs dfs -cat /sqoopresult/part-m-00001

3. Import MySQL table data into Hive

sqoop import \
--connect jdbc:mysql://hadoop01.bgd01:3306/userdb \
--username root \
--password 123456 \
--table emp_add \
--hive-table itcast.emp_add_sp \
--create-hive-table  \
--hive-import \
--num-mappers 1

4. MySQL table data subset import 

1) “--where”参数进行数据过滤
sqoop import \
--connect jdbc:mysql://hadoop01.bgd01:3306/userdb \
--username root \
--password 123456 \
--table emp_add \
--where "city='sec-bad'" \
--target-dir /wherequery \
--num-mappers 1

2) “--query”参数进行数据过滤
sqoop import \
--connect jdbc:mysql://hadoop01.bgd01:3306/userdb \
--username root \
--password 123456 \
--query 'select id, name, deg from emp where id > 1203 and $CONDITIONS' \
--target-dir /wherequery2 \
--num-mappers 1

4. Sqoop data export

In the userdb database, use the file emp_export.sql to create a table emp_export

进入data目录创建一个名为emp_export.sql的文件
在里面填写Mysql的相关代码,代码如下:
use userdb;

DROP TABLE IF EXISTS `emp_export`; 
CREATE TABLE emp_export ( 
	id int(11) NOT NULL, 
	name varchar(100) DEFAULT NULL, 
	deg varchar(100) DEFAULT NULL,
	salary int(11) DEFAULT NULL, 
	dept varchar(10) DEFAULT NULL,
	PRIMARY KEY (id)
);

启动Mysql,执行如下指令:
MariaDB [(none)]>  source emp_export.sql

Export the part-m-00000 file under the directory /sqoopresult on HDFS to the emp_export of the database userdb

sqoop export \
--connect jdbc:mysql://hadoop01.bgd01:3306/userdb \
--username root \
--password=123456 \
--table emp_export \
--export-dir /sqoopresult \
--num-mappers 1

reference books

"Hadoop big data technology principle and application"

 

Guess you like

Origin blog.csdn.net/weixin_63507910/article/details/128742161