Sqoop production common problems and optimization summary

table of Contents

0 Preface

1 Common problems in production

(1) Sqoop null value problem

(2) Sqoop data consistency problem

(3) The problem of synchronizing data from ADS layer to Mysql and storing data as orc or parquet

(4) Data skew problem

(5) The problem of Map task parallelism setting greater than 1

2 summary


0 Preface

As an important data synchronization tool, sqoop has an important position in big data. This article summarizes the common problems encountered in Sqoop production and gives specific solutions.

1 Common problems in production

(1) Sqoop null value problem

   Hive in Null at the bottom is " \ N " to store, and MySQL in Null at the bottom is Null , which leads to inconsistencies when storing data synchronized on both sides. Sqoop should strictly ensure that the data format and data type at both ends are the same when synchronizing, otherwise it will cause exceptions.

   Option 1: Depend on its own parameters

   1) Two parameters --input-null-string and --input-null-non-string are used when exporting data .

   2) Use --null-string and --null-non-string when importing data .

 Option 2: Modify the underlying storage of hive when creating the table, and modify it to'' (empty string)

    When exporting hive, create a temporary table for the table that needs to be exported. This table is strictly consistent with the table, fields, and types synchronized by Mysql. Insert the data that needs to be exported into the table. When creating the temporary table, The Null underlying storage "/N" in hive is changed to "(empty string). Specifically, you can add the following sentence

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' with serdeproperties('serialization.null.format' = '') 

  Examples are as follows:

drop table $output_table;

CREATE TABLE IF NOT EXISTS $output_table(
gw_id STRING,
sensor_id STRING,
alarm_level STRING,
alarm_state STRING,
alarm_type STRING,
alarm_scene STRING,
dyear string,
dmonth string,
count BIGINT,
compute_month BIGINT) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' with serdeproperties('serialization.null.format' = '') 
location '/apps/hive/warehouse/phmdwdb.db/$log_dir';
  •   Then insert the data that needs to be exported into the temporary table
  • Finally, use the sqoop export command to export the data to Mysql

  The same is true for importing data and will not be described again.

  •   The second solution is recommended in production. Although it is more troublesome, it can reduce some unnecessary troubles caused by sqoop, and it is easier to locate the problem. The logic is clear. When sqoop is exported, you only need to write basic export commands. In this way, sqoop can easily be made into a general script to schedule.

(2) Sqoop data consistency problem

      1 ) Scene 1 : As Sqoop export to Mysql when using . 4 th Map Tasks, the process has 2 task failed, that this time MySQL stored in two additional Map Data task introduced, just to see this case the boss Report data. When the development engineer finds that the task fails, he will debug the problem and finally import all the data correctly into MySQL . Then the boss will look at the report data again and find that the data seen this time is inconsistent with the previous one, which is not allowed in the production environment.

     Official website: http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.

  • Since Sqoop decomposes the export process into multiple transactions, a failed export job may cause part of the data to be submitted to the database. This may further cause subsequent jobs to fail due to insertion conflicts in some cases, or cause duplicate data in other jobs. You can solve this problem by specifying the staging table with the --staging-table option, which acts as an auxiliary table for staging the exported data. The phased data is eventually moved to the target table in a single transaction .

--staging-table way

  • (Create a temporary table, import it to the temporary table through sqoop, and then import the data of the temporary table into the business data table of mysql through transaction after success. Sqoop can solve many problems by creating temporary tables when importing and exporting data, so learn to be clever Temporary table)
  • Use the --staging-table option to first import the data in hdfs into the temporary table . When the data in hdfs is successfully exported , the data in the temporary table is exported to the target table in a transaction (that is to say, this process requires Not completely successful, or completely failed).
  • In order to be able to use the staging option, the staging table is empty or empty before running the task, or the --clear-staging-table configuration is used, if there is data in the staging table, and the --clear-staging-table option is used, sqoop All data in the staging table will be deleted before the export task is executed.

Note : The staging mode is not available when -direct import is used, and the staging mode cannot be used when the -update-key option is used.

sqoop export --connect jdbc:mysql://192.168.137.10:3306/user_behavior 
--username root  \
--password 123456  \
--table app_cource_study_report  \
--columns watch_video_cnt,complete_video_cnt,dt \
--fields-terminated-by "\t"  \
--export-dir "/user/hive/warehouse/tmp.db/app_cource_study_analysis_${day}"  \
--staging-table app_cource_study_report_tmp  \
--clear-staging-table  \
--input-null-string '\N' \

    2 ) Scenario 2 : Set the number of maps to 1 (not recommended)

    When multiple Map tasks are used, the –staging-table method can still solve the problem of data consistency.

(3) The problem of synchronizing data from ADS layer to Mysql and storing data as orc or parquet

        During Sqoop export, if the exported table is stored as orc or parquet, an error will be reported. The specific error is as follows:

In fact, the error is not obvious, check the specific log on yarn:

2020-04-22 11:24:47,814 FATAL [IPC Server handler 5 on 43129]
 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
 
attempt_1586702868362_6282_m_000003_0 - exited : java.io.IOException: Can't export data, 
 
please check failed map task logs at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:122) at 
 
org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39) at 
 
org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at 
 
org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64) at 
 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at 
 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at 
 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at 
 
java.security.AccessController.doPrivileged(Native Method) at 
 
javax.security.auth.Subject.doAs(Subject.java:422) at 
 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at
 
 org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: 
 
java.lang.RuntimeException: Can't parse input data: '�aҩ;����%�G8��}�_yd@rd�yd�$�����瑑
 
g�7V!o���+��O��s�4���R�v�p)�ћȜB��X�'���F!� 
 
�!_@�^�,��ȃ�|�|�YX�~?����>�a�i��6�=���g��?��r��-
 
�љ�ɪ���șk���ȅȥJȕߑk� �+ wS �. �Cw' at appv_phm_switch_master_min_orc.__loadFromFields(appv_phm_switch_master_min_orc.java:1345)
 
 at appv_phm_switch_master_min_orc.parse(appv_phm_switch_master_min_orc.java:1178) at 
 
org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:89) ... 10 more 
 
Caused by: java.util.NoSuchElementException at 
 
java.util.ArrayList$Itr.next(ArrayList.java:854) at 
 
appv_phm_switch_master_min_orc.__loadFromFields(appv_phm_switch_master_min_orc.java:1230) 
 
... 12 more

It can be seen that sqoop cannot parse files in ORC format.

Solution:

(1) Solve the problem that Sqoop does not support ORC through Sqoop-HCatalog integration (more troublesome).

(2) Conservative approach: (recommended method)

        It is recommended to change the tmp table to mysql to the default texfile form

        The tmp table is modified to the original textfile format and executed successfully.

For the dws layer or the ADS layer, the temporary table tmp is constructed from the tables imported by sqoop into Mysql. This table is strictly consistent with the Mysql table. It is mainly used to interface with Mysql. The table maintains the default strategy. (Textfile format)

(4) Data skew problem

    Sqoop's data segmentation strategy is not good enough to cause data skew

   sqoop pumping parallelization mainly related to the number of two parameters: NUM-by mappers : Start N a map to import data in parallel, by default . 4 months; Split-by: segmentation unit of work in accordance with a column of the table.

  • To avoid data skew, the requirement for split-by specified fields is int type and the data is evenly distributed. Only a very small number of tables with auto-incrementing primary keys can meet this requirement. The core idea is to generate an orderly and uniform self-incrementing ID by itself, and then use it as the splitting axis of the map, so that each map can be divided into uniform data, and the throughput can be improved by setting the number of maps.

Suggestion:
Use 4 maps if the data volume is less than 500w.
The data volume is more than 500w and 8 is enough, too much will pressurize the database and cause performance degradation in other scenarios.
If it is for special guiding of data, the degree of parallelism with downstream calculations can be increased appropriately.


Leading scenario: usually you can specify the zizengID column corresponding to split-by, and then use -num-mappers or -m to specify the number of maps, that is, the number of concurrent extraction processes. But sometimes, many tables do not add auto-incrementing ID or integer primary keys, or the primary keys are unevenly distributed, which will slow down the entire job process.

  • According to the design of the sqoop source code, we can use the –query statement to add an auto-increment ID as a split-by parameter, and at the same time, we can set the boundary by setting the auto-increment ID range.

The core syntax is as follows:


```bash
--query 方式:涉及参数  --query、--split-by、--boundary-query

--query: select col1、 col2、 coln3、 columnN from (select ROWNUM() OVER() AS INC_ID, T.* from table T where xxx )  where $CONDITIONS
--split-by: INC_ID
--boundary-query: select 1 as MIN , sum(1) as MAX from table where xxx

The complete example syntax:
password-file is obtained by echo -n "password content"> passsword-file, so that it does not contain abnormal characters.

sqoop import --connect $yourJdbConnectURL  \
--username $yourUserName
--password-file  file;///localpasswordFile  or hdfs relative Path
--query "" \
--split-by "" \
-m 8 \
-boundary-query “select 1 as min , sum(1) as max from table where  xx” \
--other parames

Reference link: https://blog.csdn.net/qq_27882063/article/details/108352356?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.pc_relevant-task-t0.noneblog -BlogCommendFromMachineLearnPai2-1.channel_param

(5) The problem of Map task parallelism setting greater than 1

When importing data in parallel, you need to specify which field to split according to. The field is usually a primary key or a self-increasing non-repeating numeric field, otherwise the following error will be reported.

Import failed: No primary key could be found for table. Please specify one with --split-by or perform a sequential import with ‘-m 1’.

That is to say, when the parallelism of map task is greater than 1, the following two parameters should be used at the same time

--Split-by id specifies to split according to the id field

-Mn specifies n map parallelism

2 summary

This article summarizes the common problems encountered in Sqoop production and gives specific solutions.

 

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/109200453