Sqoop+mysql+hive collects data & examples

Table of contents

Business scene

solution

Specific steps

1. Download and install sqoop

2. sqoop connection test

3. Use sqoop to log mysql data to hive table

summary

sqoop import import and export command parameters

Example 1 sqoop import

Points to note when importing

 Example 2 sqoop export

Points to note when exporting

Encounter problems


Business scene

Scenario 1. In business scenarios, there are situations where business data needs to be stored in hive for data BI statistics.

Scenario 2. The final result after statistical analysis of hive data needs to be dumped to mysql for display on the client side.

solution

We can use the sqoop tool to put the data in the business database mysql or oracle into the hive table to facilitate subsequent statistical analysis of big data.

Specific steps

Note: adults who have installed mysql database or oracle data can search for information by themselves;

Installation of hive and hadoop:

Windows10 install Hadoop3.3.0_xieedeni's blog - CSDN blog

Install Hive3.1.2 on Windows 10_xieedeni's Blog - CSDN Blog

1. Download and install sqoop

1. Download

Download address: Index of /dist/sqoop

Note the version here:

The sqoop versions are: sqoop1 and sqoop2. For the specific differences between the two, adults can search for information by themselves.

The version of sqoop1 is below 1.4.7, the version of sqoop2 is above 1.99.1, and the latest version is 1.99.7.

Here I downloaded and used version 1.4.7: Index of /dist/sqoop/1.4.7

2. Configure environment variables

Here I use the windows environment to install and use

 After decompression, configure the environment variable SQOOP_HOME=sqoop decompression address, add new configuration %SQOOP_HOME%/bin to path

 3. Modify the sqoop configuration file

Copy the sqoop-env-template.sh under the file %SQOOP_HOME%/conf and name it sqoop-env.sh

Modify %SQOOP_HOME%/conf/sqoop-env.sh

# Set Hadoop-specific environment variables here.

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=D:\work\soft\hadoop-3.3.0

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=D:\work\soft\hadoop-3.3.0

#set the path to where bin/hbase is available
#export HBASE_HOME=

#Set the path to where bin/hive is available
export HIVE_HOME=D:\work\soft\apache-hive-3.1.2-bin

export HIVE_CONF_DIR=D:/work/soft/apache-hive-3.1.2-bin/conf
#Set the path for where zookeper config dir is
#export ZOOCFGDIR=

4.mysql-connector-java-8.0.x.jar

 Download and copy one mysql-connector-java-8.0.x.jarto the %SQOOP_HOME%/libdirectory:

mysql-connector-java-8.0.21.jar

Download address: https://dev.mysql.com/downloads/file/?id=496589

2. sqoop connection test

1.version test

sqoop version

2. Connect to the database test

sqoop list-databases --connect jdbc:mysql://127.0.0.1:3306/mydb --username root --password 123456

 The table information is output and it is successful.

3. Use sqoop to log mysql data to hive table

1. Import in full

sqoop import --connect jdbc:mysql://127.0.0.1:3306/ddbi --username root --password 123456 --table behavior --hive-import --hive-database=dd_database_bigdata --hive-table dwd_base_event_log_his --m 1 --input-null-string '\\N' --input-null-non-string '\\N'

The input-null-string and input-null-non-string here are empty string fields in mysql

 succeeded

 Make hive query

 select * from tablename where id = 1; 

Does hive need to submit table creation? In fact, you don’t need to build it, because it will be created when you import it again

2. Incremental import

sqoop import --connect jdbc:mysql://127.0.0.1:3306/ddbi --username root --password 123456 --table behavior --hive-import --hive-database dd_database_bigdata --hive-table dwd_base_event_log_his --m 1 --incremental append --check-column id --last-value 124870 --input-null-string '\\N' --input-null-non-string '\\N'

 update completed

3. Incremental import job

a. Create a job for incremental extraction

sqoop job --create fdc_equipment_job \

         -- import --connect jdbc:oracle:thin:@xx.xx.xx.xx:1521:xx \

                    --username xxx--password xxx\

                    --table PROD_FDC.EQUIPMENT  \

                    --target-dir=/user/hive/warehouse/fdc_test.db/equipment \

                    --hive-import --hive-database fdc_test --hive-table equipment \

                    --incremental append \

                    --check-column equipmentid --last-value 1893

Note: For incremental extraction, you need to specify --incremental append, and at the same time specify which pk field in the source table to increment --check-column equipmentid, and specify the current maximum value of pk in the hive table --last-value 1893. The purpose of creating a sqoop job is that after each job is executed, sqoop will automatically record the last-value of pk, and when it is executed next time, it will automatically specify the last-value, and there is no need to manually change it.

 b. Execute sqoop job

sqoop job --exec fdc_equipment_job

c. Delete the sqoop job

sqoop job --delete fdc_equipment_job

d. View sqoop job

sqoop job --show sqoop_job_order

sqoop job --create sqoop_job_behavior_his -- import --connect jdbc:mysql://127.0.0.1:3306/ddbi --username root --password 123456--table behavior --hive-import --hive-database dd_database_bigdata --hive-table dwd_base_event_log_his --incremental append --check-column id --last-value 125357 --m 1 --input-null-string '\\N' --input-null-non-string '\\N'

sqoop job --exec sqoop_job_behavior_his

summary

sqoop import import and export command parameters

通用通用参数
选项     含义说明
–connect     指定JDBC连接字符串
–connection-manager     指定要使用的连接管理器类
–driver     指定要使用的JDBC驱动类
–hadoop-mapred-home 指定$HADOOP_MAPRED_HOME路径
–help     打印用法帮助信息
–password-file     设置用于存放认证的密码信息文件的路径
-P     从控制台读取输入的密码
–password     设置认证密码
–username     设置认证用户名
–verbose     打印详细的运行信息
–connection-param-file     可选,指定存储数据库连接参数的属性文件

import
选项     含义说明
–append     将数据追加到HDFS上一个已存在的数据集上
–as-avrodatafile     将数据导入到Avro数据文件
–as-sequencefile     将数据导入到SequenceFile
–as-textfile     将数据导入到普通文本文件(默认)
–boundary-query     边界查询,用于创建分片(InputSplit)
–columns <col,col,col…>     从表中导出指定的一组列的数据
–delete-target-dir     如果指定目录存在,则先删除掉
–direct     使用直接导入模式(优化导入速度)
–direct-split-size     分割输入stream的字节大小(在直接导入模式下)
–fetch-size     从数据库中批量读取记录数
–inline-lob-limit     设置内联的LOB对象的大小
-m,–num-mappers     使用n个map任务并行导入数据
-e,–query     导入的查询语句
–split-by     指定按照哪个列去分割数据
–table     导入的源表表名
–target-dir 导入HDFS的目标路径
–warehouse-dir HDFS存放表的根路径
–where 指定导出时所使用的查询条件
-z,–compress     启用压缩
–compression-codec     指定Hadoop的codec方式(默认gzip)
–null-string     果指定列为字符串类型,使用指定字符串替换值为null的该类列的值
–null-non-string <null-string     如果指定列为非字符串类型,使用指定字符串替换值为null的该类列的值

--create-hive-table	如果Hive表不存在,则自动创建;如果以及存在,则会报错
--hive-drop-import-delims	导入到Hive时,删除原数据中包含的 \n, \r,\01字符。
--hive-delims-replacement	导入到Hive时,将原数据中的\n, \r,  \01 替换成自定义的字符。
--hive-partition-key	指定Hive表的分区字段。
--hive-partition-value <v>	指定导入Hive表的分区字段的值。
--map-column-hive <map>	设置导入Hive时,指定字段的数据类型。如设置ID为S听类型:--map-column-hive  ID=String

export
选项     含义说明
–validate     启用数据副本验证功能,仅支持单表拷贝,可以指定验证使用的实现类
–validation-threshold     指定验证门限所使用的类
–direct     使用直接导出模式(优化速度)
–export-dir
    导出过程中HDFS源路径
-m,–num-mappers     使用n个map任务并行导出
–table     导出的目的表名称
–call     导出数据调用的指定存储过程名
–update-key     更新参考的列名称,多个列名使用逗号分隔
–update-mode     指定更新策略,包括:updateonly(默认)、allowinsert
–input-null-string     使用指定字符串,替换字符串类型值为null的列
–input-null-non-string     使用指定字符串,替换非字符串类型值为null的列
–staging-table     在数据导出到数据库之前,数据临时存放的表名称
–clear-staging-table     清除工作区中临时存放的数据
–batch     使用批量模式导出

Example 1 sqoop import

#!/bin/bash
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
#do_date=$(date -d "-1 day" +%F)

if [ -n "$1" ]; then
  do_date=$1
else
  do_date=$(date -d "-1 day" +%F)
fi

jdbc_url_dduser="jdbc:mysql://xxx:3306/user?serverTimezone=Asia/Shanghai&characterEncoding=utf8&tinyInt1isBit=false"

jdbc_username=root
jdbc_password=123456

echo "===开始从mysql中提取业务数据日期为 $do_date 的数据==="

#sqoop-mysql2hive-appconfig
sqoop import --connect $jdbc_url_dduser --username $jdbc_username --password $jdbc_password --table app_config --hive-overwrite --hive-import --hive-table dd_database_bigdata.ods_app_config --target-dir /warehouse/dd/bigdata/ods/tmp/ods_app_config -m 1 --input-null-string '\\N' --input-null-non-string '\\N'
#sqoop-mysql2hive-content
sqoop import --connect $jdbc_url_ddresource --username $jdbc_username --password $jdbc_password --query "select  n_id,u_id,u_app,app_id,global_id,nm_id,n_type,n_title,n_category,n_source,n_publish_time,n_create_time from news where DATE_FORMAT(n_create_time,'%Y-%m-%d')='$do_date' and 1=1 and \$CONDITIONS " -m 1 --hive-partition-key dt --hive-partition-value $do_date --target-dir /warehouse/dd/bigdata/ods/tmp/ods_content --hive-overwrite --hive-import --hive-table dd_database_bigdata.ods_content --input-null-string '\\N' --input-null-non-string '\\N'


echo "===从mysql中提取日期为 $do_date 的数据完成==="

Points to note when importing

1. The storage format imported into the hive table needs to be in textfile format, pay attention to specify the delimiter

2. Pay attention to add -m 1, if not, you need to specify --split-by

3. When using --query, the sql where conditional statement must contain $CONDITIONS, which is a placeholder for sqoop. If sql is wrapped in quotation marks, pay attention to escaping: \$CONDITIONS.

4. When using --query, --target-dir must be added. This is because the data sqoop operation of --query is first stored on hdfs, which is the location where the specified file is temporarily stored.

 Example 2 sqoop export

#!/bin/bash
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ]; then
  do_date=$1
else
  do_date=$(date -d "-1 day" +%F)
fi

jdbc_url="jdbc:mysql://xxx:3306/ddbi?serverTimezone=Asia/Shanghai&characterEncoding=utf8"
jdbc_username=root
jdbc_password=123456


echo "===开始从hive结果表中提取数据到mysql日期为 $do_date 的数据==="

echo "===先删除mysql表中日期为 $do_date 的数据==="
sqoop eval --connect $jdbc_url --username $jdbc_username --password $jdbc_password --query "delete from ads_article_share_info where DATE_FORMAT(date_id,'%Y-%m-%d') = '$do_date'"
echo "===完成删除mysql表中日期为 $do_date 的数据==="
echo "===进行hive导入mysql表中日期为 $do_date 的数据==="
sqoop export --connect $jdbc_url --username $jdbc_username --password $jdbc_password --table ads_article_share_info --export-dir /warehouse/dd/bigdata/ads/ads_article_share_info/dt=$do_date --columns "date_id,measure_id,measure_value,biz_id,biz_code,create_time,update_time" --fields-terminated-by '\t' --input-null-string '\\N' --input-null-non-string '\\N'
echo "===完成hive导入mysql表中日期为 $do_date 的数据==="

echo "===完成从hive结果表中提取数据到mysql日期为 $do_date 的数据==="

Points to note when exporting

1. When exporting from a hive table to a relational database, the table storage format of hive needs to be in textfile format, because the export is actually exported in the form of a file. If it is not in this format, an error will be reported when exporting, prompting not file.

2.export pay attention to specify --fields-terminated-by delimiter, which refers to the delimiter of hive table structure

3. When sqoop exports the data of the hive table partition, --export-dir is specified to the partition, such as --export-dir /warehouse/dd/bigdata/ads/ads_article_share_info/dt=2021-11-01

4. When sqoop is exported to mysql, the data needs to be updated in the following ways:

        a. You can use --update-key to specify the primary key to check and update in mysql. At this time, note that multiple values ​​can be separated by commas, but it should be noted that this field should preferably be the primary key, and the field should not be null. For example: --update-mode allowinsert --update-key stat_date,create_date. When using it, you need to add --update-mode (allowinsert, updateonly) to specify whether to check that only updates or new operations can be performed.

        b. If the table that needs to be exported to mysql needs to contain a null field as the only one row of data, you can delete it first, and then perform the export operation. As in the case of the example.

Encounter problems

1. In step 2, an error is reported when connecting to the database test Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils

2021-09-30 13:55:56,530 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
        at org.apache.sqoop.manager.MySQLManager.initOptionDefaults(MySQLManager.java:73)
        at org.apache.sqoop.manager.SqlManager.<init>(SqlManager.java:89)
        at com.cloudera.sqoop.manager.SqlManager.<init>(SqlManager.java:33)
        at org.apache.sqoop.manager.GenericJdbcManager.<init>(GenericJdbcManager.java:51)
        at com.cloudera.sqoop.manager.GenericJdbcManager.<init>(GenericJdbcManager.java:30)
        at org.apache.sqoop.manager.CatalogQueryManager.<init>(CatalogQueryManager.java:46)
        at com.cloudera.sqoop.manager.CatalogQueryManager.<init>(CatalogQueryManager.java:31)
        at org.apache.sqoop.manager.InformationSchemaManager.<init>(InformationSchemaManager.java:38)
        at com.cloudera.sqoop.manager.InformationSchemaManager.<init>(InformationSchemaManager.java:31)
        at org.apache.sqoop.manager.MySQLManager.<init>(MySQLManager.java:65)
        at org.apache.sqoop.manager.DefaultManagerFactory.accept(DefaultManagerFactory.java:67)
        at org.apache.sqoop.ConnFactory.getManager(ConnFactory.java:184)
        at org.apache.sqoop.tool.BaseSqoopTool.init(BaseSqoopTool.java:272)
        at org.apache.sqoop.tool.ListDatabasesTool.run(ListDatabasesTool.java:44)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        ... 20 more

 The commons-lang package is missing, here we download and put it under %SQOOP%/lib

http://mirrors.tuna.tsinghua.edu.cn/apache//commons/lang/binaries/commons-lang-2.6-bin.zip

2. Import mysql to hive and report an error

2021-10-08 15:40:35,682 ERROR hive.HiveConfig: Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR is set correctly.
2021-10-08 15:40:35,687 ERROR tool.ImportTool: Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf

Putting the hive-exec-**.jar under $HIVE_HOME/lib under the lib of sqoop can solve the following problems.

3. Mysql imports to hive and reports an error HiveConf of name xxx does not exist

It should be that the resource files under the hive lib are not referenced, and a lot of information has been queried, and the environment variables added

export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:/opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/lib/*

This is written in linux. I am very confused about how to do it in windows. I tried to put all the packages under hive lib under sqoop lib, but after various attempts, I found that it didn't work. Finally, I restarted hive, and it was inexplicably successful. . . . After all, the environment built by myself is not so compatible.

cd %HIVE_HOME%/bin
hive --service metastore &

4. After executing the import command under windows, the error java.lang.ClassNotFoundException: Class tablename not found is reported
. The command executed:

sqoop import --connect "jdbc:mysql://xxx:3306/ddbi?serverTimezone=Asia/Shanghai" --username root --password 123456 --table behavior --hive-import --hive-database dd_database_bigdata --hive-table dwd_base_event_log_his --m 1 --input-null-string '\\N' --input-null-non-string '\\N'


//Among them, behavior is a data table under mysql
Error message:
java.lang.Exception: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class behavior not found

Cause of error:
Because when using the sqoop import command, the generated java file will be generated in the current directory by default, and the generated .jar file and .class file will be stored in /tmp/sqoop-/compile by default, and the two are not the same file directory, resulting in an error. So, we need to put java files, .jar files and .class files in the same directory.
Solution:
In order not to store the data in the root directory, put the generated files under xx/tmp, we need to switch to the //tmp directory

Use the following command:

cd D:\\tmp

sqoop import --connect "jdbc:mysql://xxx:3306/ddbi?serverTimezone=Asia/Shanghai" --username root --password 123456 --table behavior --hive-import --hive-database dd_database_bigdata --hive-table dwd_base_event_log_his --m 1 --input-null-string '\\N' --input-null-non-string '\\N' --bindir ./

Note the addition of --bindir ./

Official description:

--bindir <dir>: Specify the output path of the generated java file, the compiled class file, and the JAR package file that packages the generated file into a JAR


The result after execution is as follows:

Guess you like

Origin blog.csdn.net/xieedeni/article/details/120565551