(Super detailed) actual combat of data conversion tool Sqoop

Installation and actual operation of data conversion tool Sqoop

JunLeon——go big or go home


Table of contents

Installation and actual operation of data conversion tool Sqoop

1. Overview of Sqoop

1. What is Sqoop?

2. Functions and features of Sqoop

2. Sqoop environment construction

1. Environment preparation

2. Installation configuration

3. sqoop test

3. Sqoop actual combat

1. Sqoop commands and parameters

2. Sqoop data import

3. Sqoop data export

4. Problem solving


 Foreword:

        With the wide application of the Hadoop big data platform, users have an increasing demand to transfer data sets between Hadoop and traditional databases. The compatibility of traditional ETL tools for the Hadoop platform is relatively low, so Apache launched Sqoop, which can transfer data between Hadoop and relational databases. The Sqoop project started in 2009 as a third-party module of Hadoop. Later, in order to allow users to quickly deploy and developers to iteratively develop more quickly, Sqoop became an Apache project independently and became a top-level project in 2012.

        Sqoop is used to efficiently transfer batch data between structured data storage and Hadoop big data platforms (such as mainframe or relational database MySQL, Oracle, Postgres, etc.). Users can use Sqoop to import data from external structured data storage into Hadoop platform, such as Hive and HBase. At the same time, Sqoop can also be used to extract data from Hadoop and export it to external structured data storage (such as relational databases and enterprise data warehouses)

1. Overview of Sqoop

1. What is Sqoop?

        Sqoop is the abbreviation of SQL to Hadoop. It is an open source and free data conversion tool, mainly used for efficient block data transfer between Hadoop and traditional relational databases. Data in a relational database can be imported into Hadoop's HDFS file system, HBase database, and Hive data warehouse, and Hadoop data can also be imported into a relational database.

2. Functions and features of Sqoop

(1) Functions of Sqoop

        Sqoop is a simple and efficient data transfer tool in Hadoop. The main functions are the following two.

import : RDBMS-->Hadoop

  • Import data from a relational database or mainframe into the Hadoop platform, where the input to the import process is a database table or a mainframe dataset. For a database, Sqoop will read the table row by row into Hadoop. For host datasets, Sqoop will read records from each host dataset to HDFS. The output of the import process is a set of files containing copies of the imported tables or datasets. The import process is performed in parallel. Therefore, output will be produced in multiple files.

    export: Hadoop -- > RDBMS

  • Export data from the Hadoop platform to a relational database or mainframe, Sqoop's export process reads the required data from HDFS in parallel, parses them into records, and if exported to the database, inserts them as new rows into the target database table In , if exported to a mainframe, it directly forms a data set for use by external applications or users.

        When users use Sqoop, they only need to operate through simple commands, and Sqoop will automate most of the process of data transmission. Sqoop uses MapReduce to import and export data, providing parallel operation and fault tolerance. In the process of using Sqoop, users can customize most aspects of the import and export process, can control specific row ranges or columns imported, and can also specify specific separators and file formats for the file-based representation of data and the file format used. escape character.

(2) Features of Sqoop

        Sqoop is specially used for data interaction between the Hadoop platform and external data sources, and is a powerful supplement to the Hadoop ecosystem. Sqoop has the following features.

  • parallel processing

    Sqoop makes full use of the parallel features of MapReduce to speed up data transmission in batch processing, and also implements fault tolerance with the help of MapReduce.

  • high applicability

    Interact with relational databases through the JDBC interface. In theory, any database that supports the JDBC interface can use Sqoop and Hadoop for data interaction.

  • Simple to use

    Users operate Sqoop through the command line, and there are only 15 commands in total. Among them, 13 commands are used for data processing, and the operation is simple. Users can easily complete the data interaction between Hadoop and RDBMS.

2. Sqoop environment construction

1. Environment preparation

(1) sqoop download

Official download: Index of /dist/sqoop

Note: Sqoop is divided into two version sequences of 1.x and 2.x (called Sqoop1 and Sqoop2), which are completely incompatible. The latest stable version of Sqoop1 is 1.4.7, so download sqoop-1.4.7.bin__hadoop- 2.6.0.tar.gz will do.

(2) Upload to the master node of the virtual machine Hadoop cluster

Use XShell or other remote connection tools to upload the downloaded sqoop compressed package to the /opt directory of the virtual machine of the Hadoop master node to facilitate subsequent installation.

2. Installation configuration

(1) Decompress the uploaded compressed package

cd /opt
# 如果压缩包不在/opt目录中,则需要使用[-C /opt]指定解压到/opt目录下
[root@BigData01 opt]# tar -zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz    # 解压压缩包
[root@BigData01 opt]# mv sqoop-1.4.7.bin__hadoop-2.6.0 sqoop-1.4.7      # 更改sqoop安装的目录名

(2) Configure Sqoop

1. Configure environment variables

vi /etc/profile

export SQOOP_HOME=/opt/sqoop-1.4.7
export PATH=$SQOOP_HOME/bin:$PATH

Make the environment variable configuration take effect:

source /etc/profile

2. Configure the sqoop-env.sh file

Copy the template file:

[root@BigData01 conf]# cd $SQOOP_HOME/conf
[root@BigData01 conf]# cp sqoop-env-template.sh sqoop-env.sh

Configuration: vi sqoop-env.sh

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/hadoop-2.7.3
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/hadoop-2.7.3
#set the path to where bin/hbase is available
export HBASE_HOME=/opt/hbase-1.4.13
#Set the path to where bin/hive is available
export HIVE_HOME=/opt/hive-2.3.3
#Set the path for where zookeper config dir is
export ZOOCFGDIR=/opt/zookeeper-3.4.12

(3) Copy the MySQL driver package to the $SQ OOP_HOME/lib directory

1. Upload the MySQL driver package

Upload the MySQL driver package mysql-connector-java-5.1.49.jar to the /opt directory

2. Copy to $SQOOP_HOME/lib directory

cd /opt
cp mysql-connector-java-5.1.49.jar $SQOOP_HOME/lib/

3. sqoop test

(1) Start the MySQL service

systemctl start mysqld      # CentOS 7 启动MySQL服务
service mysqld start        # CentOS 6 启动MySQL服务

(2) Test - connect to MySQL

[root@BigData01 ~]# sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username root -P  

3. Sqoop actual combat

1. Sqoop commands and parameters

(1) sqoop command

serial number Order kind illustrate
1 import ImportTool Import data into the cluster
2 export ExportTool Export cluster data
3 coding CodeGenTool Get a table data in the database to generate Java and package Jar
4 create-hive-table CreateHiveTableTool Create Hive table
5 eval EvalSqlTool View SQL execution results
6 import-all-tables ImportAllTablesTool Import all tables in a database to HDFS
7 job JobTool Used to generate a sqoop task. After generation, the task will not be executed unless the command is used to execute the task.
8 list-databases ListDatabasesTool list all database names
9 list-tables ListTablesTool List all tables in a database
10 merge MergeTool Combine the data under different directories in HDFS and store them in the specified directory
11 metastore MetastoreTool Record the metadata information of the sqoop job. If the metastore instance is not started, the default metadata storage directory is: ~/.sqoop. If you want to change the storage directory, you can change it in the configuration file sqoop-site.xml.
12 help HelpTool Print sqoop help information
13 version VersionTool Print sqoop version information

(2) Common parameters: database connection

serial number parameter illustrate
1 --connect URL to connect to relational database
2 --connection-manager Specifies the connection management class to use
3 --driver Hadoop root directory
4 --help Print help information
5 --password Password to connect to the database
6 --username The username to connect to the database
7 --verbose Print out details on the console

(3) Public parameters: import

serial number parameter illustrate
1 --enclosed-by <char> Prepend the specified character to the field value
2 --escaped-by <char> Escape double quotes in fields
3 --fields-terminated-by <char> Set what symbol ends each field, the default is a comma
4 --lines-terminated-by <char> Set the delimiter between each line of records, the default is \n
5 --mysql-delimiters Mysql's default delimiter setting, fields are separated by commas, lines are separated by \n, the default escape character is \, and field values ​​are wrapped in single quotes.
6 --optionally-enclosed-by <char> Appends the specified characters around field values ​​enclosed in double or single quotes.

(4) Public parameters: export

serial number parameter illustrate
1 --input-enclosed-by <char> Add specified characters before and after the field value
2 --input-escaped-by <char> Escape processing for fields containing escape characters
3 --input-fields-terminated-by <char> separator between fields
4 --input-lines-terminated-by <char> separator between lines
5 --input-optionally-enclosed-by <char> Append specified characters before and after fields with double quotes or single quotes

(5) Public parameters: hive

serial number parameter illustrate
1 --hive-delims-replacement <arg> Replace characters such as \r\n and \013 \010 in the data with a custom string
2 --hive-drop-import-delims When importing data to hive, remove characters such as \r\n\013\010 in the data
3 --map-column-hive <arg> When generating a hive table, you can change the data type of the generated field
4 --hive-partition-key Create a partition, followed by the partition name, the default type of the partition field is string
5 --hive-partition-value <v> When importing data, specify the value of a partition
6 --hive-home <dir> The installation directory of hive, you can use this parameter to override the previous default configuration directory
7 --hive-import Import data from relational database into hive table
8 --hive-overwrite Overwrite existing data in the hive table
9 --create-hive-table The default is false, that is, if the target table already exists, the creation task will fail.
10 --hive-table Followed by the hive table to be created, the MySQL table name is used by default
11 --table Specifies the table name of the relational database

(6) Sqoop-Hive commands and parameters

Command & Parameters: create-hive-table

Generate the hive table structure corresponding to the relational database table structure.

Order:

如:

sqoop create-hive-table --connect jdbc:mysql://localhost:3306/company \
    --username root --password 'root123456' \
    --table staff --hive-table hive_staff

参数:

序号 参数 说明
1 --hive-home <dir> Hive的安装目录,可以通过该参数覆盖掉默认的Hive目录
2 --hive-overwrite 覆盖掉在Hive表中已经存在的数据
3 --create-hive-table 默认是false,如果目标表已经存在了,那么创建任务会失败
4 --hive-table 后面接要创建的hive表
5 --table 指定关系数据库的表名

命令&参数:eval

可以快速的使用SQL语句对关系型数据库进行操作,经常用于在import数据之前,了解一下SQL语句是否正确,数据是否正常,并可以将结果显示在控制台。

命令:

如:

sqoop eval \--connect jdbc:mysql://localhost:3306/company \--username root \--password 'root123456' \--query "SELECT * FROM staff"

参数:

序号 参数 说明
1 --query或--e 后跟查询的SQL语句

命令&参数:import-all-tables

可以将RDBMS中的所有表导入到HDFS中,每一个表都对应一个HDFS目录

命令:

如:

sqoop import-all-tables --connect jdbc:mysql://hadoop102:3306/company \
    --username root --password 'root123456' \
    --warehouse-dir /all_tables 

参数:

序号 参数 说明
1 --as-avrodatafile 这些参数的含义均和import对应的含义一致
2 --as-sequencefile
3 --as-textfile
4 --direct
5 --direct-split-size <n>
6 --inline-lob-limit <n>
7 --m或—num-mappers <n>
8 --warehouse-dir <dir>
9 -z或--compress
10 --compression-codec

命令&参数:job

用来生成一个sqoop任务,生成后不会立即执行,需要手动执行。

命令:

如:

sqoop job --create myjob -- import-all-tables \ 
    --connect jdbc:mysql://localhost:3306/company \ 
    --username root --password 'root123456'
sqoop job --list
sqoop job --exec myjob

提示:注意import-all-tables和它左边的--之间有一个空格

提示:如果需要连接metastore,则--meta-connect jdbc:hsqldb:hsql://BigData01:16000/sqoop

参数:

序号 参数 说明
1 --create <job-id> 创建job参数
2 --delete <job-id> 删除一个job
3 --exec <job-id> 执行一个job
4 --help 显示job帮助
5 --list 显示job列表
6 --meta-connect <jdbc-uri> 用来连接metastore服务
7 --show <job-id> 显示一个job的信息
8 --verbose 打印命令运行时的详细信息

提示:在执行一个job时,如果需要手动输入数据库密码,可以做如下优化

<property>
    <name>sqoop.metastore.client.record.password</name>
    <value>true</value>
    <description>If true, allow saved passwords in the metastore.</description>
</property>

命令&参数:list-databases

命令:

如:

sqoop list-databases --connect jdbc:mysql://localhost:3306/ \
    --username root --password 'root123456'

参数:与公用参数一样

命令&参数:list-tables

命令:

如:

sqoop list-tables --connect jdbc:mysql://localhost:3306/company \
    --username root --password 'root123456'

参数:与公用参数一样

2、Sqoop数据导入

登录MySQL准备数据:

[root@BigData01 ~]# mysql -u root -p    # 输入密码即可
mysql> CREATE DATABASE studentdb;       # 创建数据库studentdb
mysql> USE studentdb;       # 使用该数据库
mysql> CREATE TABLE student(id INT,name VARCHAR(25) NOT NULL,age INT);      # 创建表student
mysql> INSERT INTO student(id,name,age) VALUE(1001,'zhangsan',18),(1002,'lisi',19),(1003,'wangwu',20);      # 向student表插入数据
mysql> SELECT * FROM student;       # 查看该表的数据

(1)数据从MySQL导入到HDFS中

sqoop import --connect jdbc:mysql://localhost:3306/studentdb --username root --password 'root123456' --table student -m 1

(2)查看结果

默认存储的路径在HDFS中的 user/<用户>/<表名>/ 中,比如:当前用户为root,则实际路径为:/user/root/student/,会在该路径下生成part-m-00000的文件

[root@BigData01 ~]# hadoop fs -ls -R /user/root/
drwxr-xr-x   - root supergroup       0 2021-12-10 11:05 /user/root/student
-rw-r--r--   1 root supergroup       0 2021-12-10 11:05 /user/root/student/_SUCCESS
-rw-r--r--   1 root supergroup      45 2021-12-10 11:05 /user/root/student/part-m-00000

(3)其他参数导入

注1:“-Dorg.apache.sqoop.splitter.allow_text_splitter=true”参数表示——当表的主键字段不是自增id,而是字符类型时。

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect \
    jdbc:mysql://localhost:3306/studentdb --username root --password 'password' \
    --table student

注2:“--num-mappers 1”或者“-m 1”,即最后汇总为一个输出文件。

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect \
    jdbc:mysql://localhost:3306/studentdb --username root --password 'password' \
    --table student --num-mappers 1
sqoop import --connect jdbc:mysql://localhost:3306/studentdb \
    --username root --password 'password' --table student --num-mappers 1

注3:可以指定导入到指定目录下,需要带上“ --warehouse-dir ”参数,比如:

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect \
    jdbc:mysql://localhost:3306/studentdb --username root --password 'password' \
    --table student --warehouse-dir /user/sqoop

注4:默认以短号分隔,可以更换为以\t分隔数据(需带上“--fields-terminated-by ”参数)

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect \
    jdbc:mysql://localhost:3306/studentdb --username root --password 'password' \
    --table student --warehouse-dir /user/sqoop --fields-terminated-by '\t' -m1 

3、Sqoop数据导出

在本地准备文件上传至Hadoop的HDFS中

[root@BigData01 ~]# vi sqoopdata.txt    # 输入几行数据
103  wangwu      22
102  lisi        21 
101  zhangsan    20
[root@BigData01 ~]# hadoop fs -mkdir -p /user/data
[root@BigData01 ~]# hadoop fs -put sqoopdata.txt /user/data/

在MySQL中创建好Hadoop数据导出的表:

[root@BigData01 ~]# mysql -u root -p
Enter password:
mysql> create database testdb;
Query OK, 1 row affected (0.00 sec)
mysql> use testdb;
Database changed
mysql> create table student(id int,name varchar(25) not null,age int);
Query OK, 0 rows affected (0.03 sec)
mysql> quit;
Bye

在命令行中输入命令导出数据到指定的MySQL表中存储

[root@BigData01 ~]# sqoop export --connect jdbc:mysql://localhost:3306/testdb?characterEncoding=UTF-8 \
 --username root --password 'root123456' --table student \
 --export-dir '/user/data/*' --fields-terminated-by '\t'

验证MySQL的表中是否有数据:

[root@BigData01 ~]# mysql -u root -p
Enter password:
mysql> use testdb;
Database changed
mysql> SELECT * FROM student;
+------+----------+------+
| id   | name     | age  |
+------+----------+------+
|  103 | wangwu   |   22 |
|  102 | lisi     |   21 |
|  101 | zhangsan |   20 |
+------+----------+------+
3 rows in set (0.00 sec)

4、问题解决

(1)当导出时出现如下类似错误

ERROR mapreduce.ExportJobBase: Export job failed!
ERROR tool.ExportTool: Error during export:
    Export job failed!
    at org.apache.sqoop.mapreduce.ExportJobBase.runExport(ExportJobBase.java:445)
    at org.apache.sqoop.manager.SqlManager.exportTable(SqlManager.java:931)
    at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:80)
    at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:99)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:252)

解决办法:

a. 在MySQL中授权:

对其他节点添加了远程访问,但没有对自己添加远程访问权限。

grant all privileges on *.* to 'root'@'%' identified by 'password' with grant option; 
flush privileges;

单独添加对 BigData01 的远程访问权限

grant all privileges on *.* to 'root'@'BigData01' identified by 'password' with grant option;   
flush privileges;

b. jdbc连接不要使用localhost,可使用host名称或IP地址。

c. Note: The field values ​​of the HDFS data file correspond to the fields of MySQL, and pay attention to the field type and string length.

d. Input data field separator, output data field separator

e. If there are Chinese characters in the data, consider adding a permanent setting of MySQL character encoding in the connection database parameters -- to solve the problem of jdbc data import due to the presence of Chinese characters!

sqoop export --connect "jdbc:mysql://BigData01:3306/testDB?useUnicode=true&characterEncoding=utf8mb4" \
    --username root -password 'root123456' --table Users --input-fields-terminated-by "," \
    --export-dir '/user/root/Users/part-m-00000' --fields-terminated-by ','

Next: The actual combat of the log collection tool Flume (super detailed)

If you like it and it is helpful to you, like + bookmark, and learn knowledge from Brother Jun...

おすすめ

転載: blog.csdn.net/JunLeon/article/details/122160032