Sqoop exoprt actual combat avoiding pits (parquet format, detailed column parameters)

table of Contents

1. Data export in Parquet+Snappy compression format

1. Create table dwd_report_site_hour

2. Solution

Two, Sqoop export columns parameter description

1. The order of Hive fields is consistent with MySQL 

2. Adjust the order of fields in Sqoop export columns

3. Test Sqoop export columns to reduce a field

Three, summary


1. Data export in Parquet+Snappy compression format

1. Create table dwd_report_site_hour

Create partitioned external table in parquet format, SNAPPY compression

create external table if not exists dwd_report_site_hour(
sitetype string,
sitename string,
innum int,
outnum int,
datatime string,
inserttime timestamp,
modifyTime timestamp
)
partitioned by(dt string)
row format delimited
fields terminated by '\001'
stored as parquet TBLPROPERTIES('parquet.compression'='SNAPPY');

 

Partitioned external tables are stored in parquet text format with SNAPPY compression. Importing indicator data from Hive data warehouse to MySQL job task fails.

Currently, when using Sqoop to extract data from Hive's parquet to a relational database, it will report that kitesdk cannot find the file. This is a known problem of Sqoop. Refer to SQOOP-2907 (reference):

https://issues.apache.org/jira/browse/SQOOP-2907

2. Solution

sqoop-export --connect jdbc:mysql://192.168.2.226:3306/kangll \
--username root \
--password winner@001 \
--table dwd_report_site_hour \
--update-key sitetype,sitename \
--update-mode allowinsert \
--input-fields-terminated-by '\001' \
--hcatalog-database kangll \
--hcatalog-table dwd_report_site_hour \
--hcatalog-partition-keys dt \
--hcatalog-partition-values '20200910' \
--num-mappers 1 \
--input-null-string '\\N' \
--input-null-non-string '\\N'

Parameter Description:

--table                      MySQL库中的表名
--hcatalog-database          Hive中的库名
--hcatalog-table             Hive库中的表名,需要抽数的表
--hcatalog-partition-keys    分区字段
--hcatalog-partition-values  分区值
--num-mappers                执行作业的Map数

Two, Sqoop export columns parameter description

If the column parameter is not used, the order and number of fields in the Hive table must be the same as that of the MySQL table by default. If the order or number of fields are inconsistent, the columns parameter can be added for export control.

1. The number of table fields in hive and the number of target mysql table fields can be inconsistent, add the columns parameter
2. The order of the table fields in hive and the target mysql table field names can be adjusted, plus the columns parameter

1. The order of Hive fields is consistent with MySQL 

1. Add id field 

2. Insert two records in MySQL

3. Add a record to the Hive table

4. Specify columns

​
# Sqoop从Hive的parquet格式存储 抽数到关系型数据库
sqoop-export --connect jdbc:mysql://192.168.2.226:3306/kangll \
--username root \
--password kangll \
--table dwd_report_site_hour \
--update-key sitetype,sitename \
--update-mode allowinsert \
--input-fields-terminated-by '\001' \
--hcatalog-database kangll \
--hcatalog-table dwd_report_site_hour \
--hcatalog-partition-keys dt \
--hcatalog-partition-values '20200910' \
--num-mappers 1 \
--columns sitetype,sitename,innum,outnum,datatime,inserttime,modifyTime

 

5. Check whether the MySQL table export is successful

2. Adjust the order of fields in Sqoop export columns

Below I adjust the order of columns in export to see if I can successfully export to MySQL 

1. Insert a record in the Hive table

2. Adjust the field order of sitetype and sitename in colums

# 与 MySQL 的字段顺序保持一致
--columns sitetype,sitename,innum,outnum,datatime,inserttime,modifyTime
# 将 sitetype, sitename 字段顺序调换
--columns sitename,sitetype,innum,outnum,datatime,inserttime,modifyTime

3. View MySQL

3. Test Sqoop export columns to reduce a field

1. Remove the dataTime field

# 与 MySQL 的字段顺序保持一致
--columns sitetype,sitename,innum,outnum,datatime,inserttime,modifyTime
# 将 sitetype, sitename 字段顺序调换
--columns sitename,sitetype,innum,outnum,datatime,inserttime,modifyTime
# 将 datatime 字段 去掉
--columns sitename,sitetype,innum,outnum,inserttime,modifyTime

2. Remove Sqoop export columns 

​sqoop-export --connect jdbc:mysql://192.168.2.226:3306/kangll \
--username root \
--password winner@001 \
--table dwd_report_site_hour \
--update-key sitetype,sitename \
--update-mode allowinsert \
--input-fields-terminated-by '\001' \
--hcatalog-database kangll \
--hcatalog-table dwd_report_site_hour \
--hcatalog-partition-keys dt \
--hcatalog-partition-values '20200910' \
--num-mappers 1 \
--columns sitename,sitetype,innum,outnum,inserttime,modifyTime

3. Insert another piece of data into the hive table

 

Three, summary

1. For data export in hive Parquet+Snappy format, use the hcatalog parameter. If you don’t use it, you can use another method to query the Hive table data and put it in the HDFS temporary directory, then get it to the local, using the MySQL load file method , This method is obviously more troublesome, but SQL has good support for processing table field content and is more flexible.

2. The columns parameter in Sqoop export allows the fields in the Hive table to correspond to the fields in the MySQL table. If a field imported by Hive is missing, it will default to NULL.

 

Guess you like

Origin blog.csdn.net/qq_35995514/article/details/108542495