ERROR 5025 (HY000): Insert has filtered data in strict mode, tracking_url=http://IP

The use of the Insert Into statement is similar to the use of the Insert Into statement in databases such as MySQL. But in Doris, all data writing is an independent import job, so Insert Into is also introduced here as an import method.

1. Syntax and parameters

The syntax for inserting data into Insert Into is as follows:

INSERT INTO table_name
[ PARTITION (p1, ...) ]
[ WITH LABEL label]
[ (column [, ...]) ]
{ VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }

The above syntax parameters are explained as follows:

  • tablet_name: The destination table for importing data. Can be of the form db_name.table_name.
  • partitions: Specify the partitions to be imported, which must exist in table_name. Multiple partition names are separated by commas.
  • label: Specify a label for the Insert task.
  • column_name: The specified destination column must be a column that exists in table_name.
  • expression: The corresponding expression that needs to be assigned to a column.
  • DEFAULT: Let the corresponding column use the default value.
  • query: A normal query, the results of the query will be written to the target.

The Insert Into command needs to be submitted through the MySQL protocol. Creating an import request will return the import result synchronously. The main Insert Into commands include the following two types:

  • INSERT INTO tbl SELECT ...

INSERT INTO tbl (col1, col2, ...) VALUES (1, 2, ...), (1,3, ...);

2. Case

Next, create table tbl1 to demonstrate the Insert Into operation.

#创建表 tbl1
CREATE TABLE IF NOT EXISTS example_db.tbl1
(
`user_id` BIGINT NOT NULL COMMENT "用户id",
`date` DATE NOT NULL COMMENT "日期",
`username` VARCHAR(32) NOT NULL COMMENT "用户名称",
`age` BIGINT NOT NULL COMMENT "年龄",
`score` BIGINT NOT NULL DEFAULT "0" COMMENT "分数"
)
DUPLICATE KEY(`user_id`)
PARTITION BY RANGE(`date`)
(
PARTITION `p1` VALUES [("2023-01-01"),("2023-02-01")),
PARTITION `p2` VALUES [("2023-02-01"),("2023-03-01")),
PARTITION `p3` VALUES [("2023-03-01"),("2023-04-01"))
)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

#通过Insert Into 向表中插入数据
mysql> insert into example_db.tbl1 values  (1,"2023-01-01","zs",18,100), (2,"2023-02-01","ls",19,200);
Query OK, 2 rows affected (0.09 sec)
{'label':'insert_1b2ba205dee54110_b7a9c0e53b866215', 'status':'VISIBLE', 'txnId':'6015'}

#创建表tbl2 ,表结构与tbl1一样,同时数据会复制过来。
mysql> create table tbl2  as select * from tbl1;
Query OK, 2 rows affected (0.43 sec)
{'label':'insert_fad2b6e787fa451a_90ba76071950c3ae', 'status':'VISIBLE', 'txnId':'6016'}

#向表tbl2中使用Insert into select 方式插入数据
mysql> insert into tbl2 select * from tbl1;
Query OK, 2 rows affected (0.18 sec)
{'label':'insert_7a52e9f60f7b454b_a9807cd2281932dc', 'status':'VISIBLE', 'txnId':'6017'}

#Insert into 还可以指定Label,指定导入作业的标识
mysql> insert into example_db.tbl2 with label mylabel values (3,"2023-03-01","ww",20,300),(4,"2023-03-01","ml",21,400);
Query OK, 2 rows affected (0.11 sec)
{'label':'mylabel', 'status':'VISIBLE', 'txnId':'6018'}

#查询表tbl2中的数据
mysql> select * from tbl2;
+---------+------------+----------+------+-------+
| user_id | date       | username | age  | score |
+---------+------------+----------+------+-------+
|       1 | 2023-01-01 | zs       |   18 |   100 |
|       1 | 2023-01-01 | zs       |   18 |   100 |
|       4 | 2023-03-01 | ml       |   21 |   400 |
|       2 | 2023-02-01 | ls       |   19 |   200 |
|       2 | 2023-02-01 | ls       |   19 |   200 |
|       3 | 2023-03-01 | ww       |   20 |   300 |
+---------+------------+----------+------+-------+
6 rows in set (0.12 sec)

Insert Into itself is a SQL command, and its return result will be divided into two situations: the result set is empty and the result set is not empty, depending on the execution results.

When the result set is empty, "Query OK, 0 rows affected" is returned. When the result set is not empty, it is divided into import success and import failure. If the import fails, the corresponding error will be returned directly. If the import is successful, a json string containing "label", "status", "txnId" and other fields will be returned. For example:

{'label':'my_label1', 'status':'visible', 'txnId':'4005'}

{'label':'insert_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':'committed', 'txnId':'4005'}

{'label':'my_label1', 'status':'visible', 'txnId':'4005', 'err':'some other error'}
  1. label is a user-specified label or an automatically generated label. Label is the identifier of this Insert Into import job. Each import job has a unique Label within a single database.
  2. status indicates whether the imported data is visible. If visible, display visible, if not visible, display committed. Invisibility of data is a temporary state, and this batch of data will eventually be visible.
  3. txnId is the id of the import transaction corresponding to this insert.
  4. The err field displays some other unexpected errors.

When the INSERT statement is currently executed, the default behavior is to filter data that does not conform to the target table format, such as excessively long strings. But for business scenarios that require data not to be filtered, you can set the session variable enable_insert_strict to true (default true, recommended to be true). You can also pass the command: set enable_insert_strict=false; set to false. If at least one piece of data is correctly imported when inserting data, success will be returned. Then the wrong data will be automatically filtered and not inserted into the data table. When you need to view the filtered rows, Users can view it through the "SHOW LOAD" statement, for example: To ensure that when data is filtered out, INSERT will not be executed successfully

#向表tbl1中插入包含错误数据的数据集,返回报错信息
mysql> insert into example_db.tbl1 values (3,"2023-03-01","wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww",20,300),(4,"2023-03-01","ml",21,400);
ERROR 5025 (HY000): Insert has filtered data in strict mode, tracking_url=http://192.168.179.6:8040/api/_load_error_log?file=__shard_0/error_log_insert_stmt_34684048e4234210-b0c4a99c9aabcb20_34684048e4234210_b0c4a99c9aabcb20

#设置 enable_insert_strict 为false
set enable_insert_strict=false;

#向表tbl1中插入包含错误数据的数据集
mysql> insert into example_db.tbl1 values (3,"2023-03-01","wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww",20,300),(4,"2023-03-01","ml",21,400);
Query OK, 1 row affected, 1 warning (0.18 sec)
{'label':'insert_43d97ba2ec544fde_b4339d3f1c93753c', 'status':'VISIBLE', 'txnId':'7010'}

#show load查看过滤的数据获取URL
mysql> show load\G;
*************************** 1. row ***************************
         JobId: 21007
         Label: insert_43d97ba2ec544fde_b4339d3f1c93753c
         State: FINISHED
      Progress: ETL:100%; LOAD:100%
          Type: INSERT
       EtlInfo: NULL
      TaskInfo: cluster:N/A; timeout(s):3600; max_filter_ratio:0.0
      ErrorMsg: NULL
    CreateTime: 2023-02-10 20:47:06
  EtlStartTime: 2023-02-10 20:47:06
 EtlFinishTime: 2023-02-10 20:47:06
 LoadStartTime: 2023-02-10 20:47:06
LoadFinishTime: 2023-02-10 20:47:06
           URL: http://192.168.179.7:8040/api/_load_error_log?file=__shard_0/error_log_insert_stmt_43d97ba2ec544fde-b4339d3f1c93753d_43d97ba2ec544fde_b4339d3f1c93753d    JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0}
 TransactionId: 7010
  ErrorTablets: {}

#执行 SHOW LOAD WARNINGS ON "url" 来查询被过滤数据信息
mysql> SHOW LOAD WARNINGS ON "http://192.168.179.7:8040/api/_load_error_log?file=__shard_0/error_log_insert_stmt_43d97ba2ec544fde-b4339d3f1c93753d_43d97ba2ec544fde_b4339d3f1c93753d"\G;
*************************** 1. row ***************************
         JobId: -1
         Label: NULL
ErrorMsgDetail: Reason: column_name[username], the length of input is too long than schema. first 32 bytes of input str: [wwwwwwwwwwwwwwwwwwwwwww
wwwwwwwww] schema length: 32; actual length: 36; . src line []; 1 row in set (0.01 sec)

​​​​3. Things to note

3.1. About the amount of inserted data

Insert Into has no limit on the amount of data, and can also support the import of large amounts of data. However, Insert Into has a default timeout. If the user estimates that the amount of imported data is too large, the system's Insert Into timeout needs to be modified. How to estimate the import time? The estimation method is as follows:

Assume that 36G data needs to be imported into Doris and the Doris cluster data import speed is 10M/s (the maximum speed limit is 10M/s. The average import speed of the current cluster can be calculated based on the previously imported data volume/seconds consumed), then it is estimated The import time is 36G*1024M/(10M/s) = ~3686 seconds.

3.2. About the result returned by insert operation

  • If the return result is ERROR 1064 (HY000), the import failed.
  • If the return result is Query OK, it means the execution is successful.
    1. If rows affected is 0, it means the result set is empty and no data has been imported.
    2. If rows affected is greater than 0:
      1. If status is committed, it means the data is not yet visible. You need to view the status through the show transaction statement until visible
      2. If status is visible, the data import is successful.
      3. If warnings is greater than 0, it means that some data is filtered. You can use the show load statement to get the url to view the filtered rows.

3.3. About import task timeout

The timeout of the import task (in seconds). If the import task is not completed within the set timeout, it will be canceled by the system and become CANCELLED. Currently, Insert Into does not support customizing the timeout time of the import. The timeout time of all Insert Into imports is unified, and the default timeout time is 1 hour. If the imported source file cannot be imported within the specified time, you need to adjust the FE parameter insert_load_default_timeout_second.

At the same time, the Insert Into statement is restricted by the Session variable query_timeout. You can increase the timeout by SET query_timeout = xxx;, the unit is seconds.

3. 4 About Session variables

  • enable_insert_strict

Insert Into import itself cannot control the error rate that the import can tolerate. Users can only control this through the enable_insert_strict Session parameter. When this parameter is set to false, it means that at least one piece of data is imported correctly, and success is returned. If there is failure data, a Label will also be returned.

When this parameter is set to true (default), it means that if there is a data error, the import fails.

  • query_timeout

Insert Into itself is also a SQL command, so the Insert Into statement is also limited by the Session variable query_timeout. You can increase the timeout by SET query_timeout = xxx;, the unit is seconds.

3.5 About data import errors

When there is an error in data import, you can view the error details by show load warnings on "url". URL is the URL in the error return information.

Guess you like

Origin blog.csdn.net/eagle89/article/details/134858350