SQOOP, the data synchronization tool in the Hadoop ecosystem


)

1. The concept of sqoop

In most common software, such as Taobao, Pinduoduo..., websites will generate a large amount of data.
E-commerce websites: order data, product data, classification data, user information data, user behavior data, etc.
Course websites: order data, videos Data, course data, user information data, etc.
Although
the data formats and data meanings in different fields are different, they all have one thing in common: most of the data is stored in the RDBMS relational database. If we want to perform big data statistical analysis on data in a certain field, we must first synchronize the data from the non-big data environment to the big data environment.
Most data is stored in RDBMS, and most big data environments are HDFS, Hive, and HBase. We need to synchronize RDBMS data to the big data environment.

SQOOP software is Apache's top open source project. sqoop.apache.org is a tool designed to transfer data between RDBMS and Hadoop (Hive, HDFS, HBase). Therefore, the
applicable scenarios of sqoop are very limited, so this technology is basically Very rarely updated. The software has basically been retired from apache

2. Core functions of sqoop

1. Data import

Refers to importing data from RDBMS (MySQL\ORACLE\SQL SERVER) into Hadoop environment (HDFS, HBASE, Hive)

The function of import is to import data from a non-big data environment to a big data environment for statistical analysis through big data technology.

2. Data export export

Refers to exporting data from Hadoop environment (HDFS, Hive, HBase) to RDBMS

After the statistical analysis of the data is completed, the result indicators are obtained. The indicators are stored in the big data environment. If the indicators are visually displayed, it is difficult to visually display the data in the big data environment. Therefore, we need to export the data from the big data environment to non-big data environments. Visual display and other subsequent operations in the data environment RDBMS

3. The underlying implementation of sqoop

Sqoop technology is also part of the Hadoop ecosystem, because when Sqoop imports and exports data, you need to write specific import and export commands, but the bottom layer of this command will also be converted into a MapReduce program for operation.

SQOOP runs on MapReduce and YARN

4. Installation and deployment of sqoop

Therefore, the bottom layer of sqoop is based on Hadoop, so sqoop only needs to be installed on a single node. sqoop also provides a command line client that can import and export data.

Sqoop software installation trilogy:

  • 1. Upload and decompress
  • 2. Configure environment variables - vim /etc/profile
  • 3. Modify the software configuration file
    • sqoop-env.sh file
    • image-20230823125955469
  • 4. Special configuration of sqoop
    • Sqoop can migrate data between big data and non-big data environments. Non-big data environments are mainly RDBMS relational databases.
    • The bottom layer of sqoop connecting to RDBMS is also based on JDBC. Therefore, if we want to use sqoop to connect to rdbms, we need to put the jdbc driver jar package corresponding to the database in the lib directory of sqoop.
    • You need to put mysql-connector-java.jar in the lib directory of sqoop

5. Basic operations of sqoop

需要跟数据库的连接参数
--connect   jdbcurl
--username  用户名
--password  密码

1. sqoop checks which databases are in the RDBMS

sqoop list-databases

image-20230823130046641

2. Use sqoop to check the data tables under a certain database.

sqoop list-tables

image-20230823130105851

3. Execute sql statements through sqoop

sqoop eval --query | -e “sql”

image-20230823130756327

6. Core functional operations of sqoop

1. Data import

  • Refers to importing data from RDBMS relational database into Hadoop environment (HDFS, Hive, HBase)

  • Importing RDBMS data into HDFS
    is not commonly used

    • HDFS导入时连接的RDBMS的参数
      --driver
      --connect  
      --username
      --password
      [--table]   导入哪张数据表的数据
      [--columns]  导入指定数据表的指定列的数据
      [--query]   根据查询语句的结果导入数据
      [--where]   筛选条件,根据指定的条件导入数据
      
      HDFS导入的参数
      --target-dir   导入到HDFS上的路径
      --delete-target-dir  如果HDFS的路径存在 提前删除
      [--as-textfile|sequencefile..]   导入到HDFS上的文件的格式
      --num-mappers    指定导入的时候底层MapReduce启动多少个Map Task运行
      --fields-terminated-by  指定导入的文件列和列的分隔符,默认是一种特殊字符,同时自动创建Hive数据表时,表的列的分隔符
      --lines-terminated-by   指定导入的文件的行和行的分割符,默认就说换行符 一般是不使用这个参数的,就算我们设置了也不生效,除非我们加上一些特殊参数
      --null-string:如果导入的MySQL数据表的某一个字符串类型的列的值为null,那么我们在HDFS的文件中使用什么字符替换null值
      --null-non-string:如果导入的MySQL数据表的某一个非字符串类型的列的值为null,那么我们在HDFS的文件中使用什么字符替换null值
      
    • 导入数据表的所有数据到HDFS:
      sqoop import --driver com.mysql.cj.jdbc.Driver --connect 'jdbc:mysql://single:3306/demo?serverTimezone=UTC&useUnicode=true&characterEncoding=UTF-8' --username root --password Root123456.. --table student --target-dir /import --delete-target-dir --fields-terminated-by '=' --num-mappers 1 --as-sequencefile
      
    • 导入数据表的指定列的数据到HDFS:
      sqoop import --driver com.mysql.cj.jdbc.Driver --connect 'jdbc:mysql://single:3306/demo?serverTimezone=UTC&useUnicode=true&characterEncoding=UTF-8' --username root --password Root123456.. --table student --columns student_name,student_age --target-dir /import --delete-target-dir  --fields-terminated-by ',' --num-mappers 1 --as-textfile
      
    • Import the specified data into HDFS according to the query statement:

      • --table table_name  --where "条件"  只能导入一张表的数据
        sqoop import --driver com.mysql.cj.jdbc.Driver --connect 'jdbc:mysql://single:3306/demo?serverTimezone=UTC&useUnicode=true&characterEncoding=UTF-8' --username root --password Root123456.. --table student --columns student_name,student_age --where "student_age<40"  --target-dir /import --delete-target-dir  --fields-terminated-by ',' --num-mappers 1 --as-textfile
        
      • --query ""   可以通过连接查询同时导入多张表的数据
        sqoop import --driver com.mysql.cj.jdbc.Driver --connect 'jdbc:mysql://single:3306/demo?serverTimezone=UTC&useUnicode=true&characterEncoding=UTF-8' --username root --password Root123456.. --query 'select * from student where student_age<40 and $CONDITIONS'  --target-dir /import --delete-target-dir  --fields-terminated-by ',' --num-mappers | -n 1 --as-textfile
        
  • Commonly used to import RDBMS data into Hive data tables

    • Import parameters

      • HDFS导入时连接的RDBMS的参数
        --driver
        --connect  
        --username
        --password
        [--table]   导入哪张数据表的数据
        [--columns] 导入指定数据表的指定列的数据
        [--query]   根据查询语句的结果导入数据
        [--where]   筛选条件,根据指定的条件导入数据
        
      • 导入到Hive中的参数
        --hive-import    指定将数据导入到Hive的数据表,而非HDFS或者HBase
        --hive-database   指定将数据导入到Hive的哪个数据库中
        --hive-table         指定将数据导入到Hive的哪个数据表中
        --create-hive-table   如果Hive表没有提前存在,那么这个选项必须添加,会根据导入的数据自动推断创建Hive数据表,但是如果Hive中的数据表已经存在,那么这个参数一定不能添加
        
      • If we import data from RDBMS into Hive, there are two import modes

        • Full import:
          Import RDBMS data into Hive for the first time

          • Import all data in the data table corresponding to RDBMS into Hive
          • The –hive-overwrite parameter adds all the data in the RDBMS table (-table, if –query --columns is added, it is not a full amount problem) to the corresponding data table of Hive, and overwrites the addition.
          • It is generally used when importing RDBMS data into Hive for the first time. If the first import does not require the –hive-overwrite option.
            If it is not the first import and you want to import all the data, you must add the –hive-overwrite option.
        • Incremental import
          Import RDBMS data to Hive for the first time

          • Import the new data corresponding to the RDBMS data table into Hive

          • Incremental import is divided into two methods: one is based on auto-increment id, and the second is based on incremental import based on a timestamp.

          • There are two ways to incrementally import sqoop: append lastmodified. Hive's incremental import only supports append mode, and HDFS incremental import supports lastmodified mode.

          • --check-column   RDBMS对应的列
            --incremental  append
            --last-value   num上一次导入的自增的值的最后一个
            
          • Import according to the auto-increment id of the RDBMS data table:
            -check-column The auto-increment column name of the rdbms data table --incremental append --last-value num append Incremental import needs to specify an RDBMS that can auto-increment or the number becomes larger sequentially a column, and also need to specify the size of the value when it was last imported.

          • Import based on a time field of the RDBMS data table:
            --check-column Time column of the rdbms data table --incremental lastmodified --last-value "The timestamp of the last piece of data imported last time"

    • Full import:
      If you want to do a full import, the Hive data table does not need to exist in advance. You can use create-hive-table to create it automatically.

      • sqoop import --driver com.mysql.cj.jdbc.Driver --connect 'jdbc:mysql://single:3306/demo?serverTimezone=CST&useUnicode=true&characterEncoding=UTF-8' --username root --password Root123456.. --table student --hive-import --hive-database test --hive-table student --create-hive-table
        
    • Incremental import:
      If you want to do incremental import, the Hive data table must exist in advance, and it must also have historical data of the corresponding data table in RDBMS.

      • Import by increment of auto-increment id

        • sqoop import --driver com.mysql.cj.jdbc.Driver --connect 'jdbc:mysql://single:3306/demo?serverTimezone=CST&useUnicode=true&characterEncoding=UTF-8' --username root --password Root123456.. --table student --hive-import --hive-database test --hive-table student --check-column student_id  --incremental append --last-value 5
          
      • Import by creation time increment

        • hive-import currently does not support timestamp increment.
    • Note: If you import data into Hive, sqoop does a total of two steps: 1. First upload the data to HDFS through the MR program, 2. Use the load command of hive to load the data file uploaded to HDFS into the data table. 3. If the Hive data table does not exist, you can specify –create-hive-table to create the data table when importing. The delimiters of the columns of the created data table and the columns of the files on HDFS set by –fields-terminated-by Delimiters remain consistent.
      When sqoop operates Hive, it needs Hive dependencies, but sqoop does not have hive programming dependencies by default. Therefore, sqoop will report an error when migrating data to hive. If you do not want to report errors, then we need to copy the hive-common.jar package to sqoop's lib Under contents.

    • [Problems with importing data time fields]

      • After the data is imported, the time in RDBMS is inconsistent with the time in Hive. This is mainly caused by time zone issues. The time zone used by RDBMS and the time zone parameters specified when importing data are not the same time zone.
      • You only need to ensure that the time zone of the RDBMS is consistent with the time zone serverTimezone set by the imported parameters.
      • RDBMS time zone: select @@global.time_zone
      • By default, as long as we are in China and have not changed the time zone of the database and system, the default time zone of the database and system is +0800, so serverTimezone=Asia/Shanghai

2. Data export export

Export data from the Hadoop platform to RDBMS. Exporting is simpler than importing. When exporting data, because the data stored in Hive and HBase are all on HDFS, you only need to learn how to export the data on HDFS to RDBMS.

[Notes on export]: The data table in RDBMS must exist in advance. If it does not exist, an error will be reported.

Export parameters

导出时和RDBMS相关的参数
--driver:JDBC的驱动类
--connect:JDBCUrl
--username:数据库的用户名
--password:数据库的密码
--table: 指定导入RDBMS中哪个数据表的数据
--columns <col,col,col...>:代表的rdbms的列名,列名必须和文件中列的顺序保持一致,防止数据串列
导出HDFS的参数
--export-dir : 导出的HDFS哪个目录下的文件数据
--num-mappers | -m  : 将导出命令翻译成为n个map task任务
--input-fields-terminated-by : 很重要,指定HDFS上文件中列和列的分隔符的
--input-lines-terminated-by : 指定HDFS上文件行和行的分割符  行的分隔符\n
--update-mode:取值allowinsert和updateonly,导出数据的两种模式
allowinsert  更新已经导出的数据,同时追加新的数据   对mysql数据库目前不支持的
updateonly  只更新以前导出的数据,新的数据不会导出
--update-key:--update-mode如果想实现它的功能,必须和--update-key结合使用,而且--update-key最好指定一个RDBMS的主键字段,否则--update-mode的效果会出现混乱

[Note] If update-mode is not specified, the default is to export in append form (data duplication will occur)

If we want to export data to MySQL and do not want the data to be repeated, we can first use the sqoop eval operation to execute the command to clear the target table data, and then export the data after the clearing is successful.

sqoop eval --driver com.mysql.cj.jdbc.Driver --connect 'jdbc:mysql://single:3306/demo?serverTimezone=CST&useUnicode=true&characterEncoding=UTF-8' --username root --password Root123456.. --query 'delete from student'

sqoop import --driver com.mysql.cj.jdbc.Driver --connect 'jdbc:mysql://single:3306/demo?serverTimezone=CST&useUnicode=true&characterEncoding=UTF-8' --username root --password Root123456.. --table student --export-dir /user/hive/warehouse/test.db/student --input-fields-terminated-by '=' -m 1 --columns 'student_id,student_name,student_age,create_time'

Import generally means that when we need to perform big data processing and analysis on RDBMS data, we import the RDBMS data to HDFS or Hive. After exporting, we complete the processing and get the result data table, and then export the result data table to RDBMS through export. , used for later data visualization display.

Guess you like

Origin blog.csdn.net/weixin_57367513/article/details/132910720