A preliminary study on DataGen, a powerful testing tool for Flink | JD Cloud technical team

What is Flinksql

Flink SQL is built based on Apache Calcite's SQL parser and optimizer, supports the ANSI SQL standard, and allows the use of standard SQL statements to process streaming and batch data. Through Flink SQL, data processing logic can be described in a declarative manner without writing explicit code. Using Flink SQL, you can perform various data operations, such as filtering, aggregation, connection and transformation, etc. It also provides functions such as window operations, time processing, and complex event processing to meet the needs of streaming data processing.

Flink SQL provides many extended functions and syntax to adapt to the characteristics of Flink's streaming and batch processing engines. It is the highest level abstraction of Flink and can be seamlessly integrated with the DataStream API and DataSet API, taking advantage of Flink's distributed computing capabilities and fault tolerance mechanism.

Basic steps to use Flink SQL to process data:

  1. Define the input table: Use the CREATE TABLE statement to define the input table, specify the table schema (fields and types) and data source (such as Kafka, file, etc.).

  2. Execute SQL queries: Use SQL statements such as SELECT and INSERT INTO to perform data queries and operations. You can use various built-in functions, aggregation operations, window operations, time attributes, etc. in SQL queries.

  3. Define the output table: Use the CREATE TABLE statement to define the output table, specify the table schema and target data storage (such as Kafka, file, etc.).

  4. Submit job: Submit the Flink SQL query as a Flink job to the Flink cluster for execution. Flink will automatically build an execution plan based on the query logic and configuration, and distribute the data processing tasks to the task manager in the cluster for execution.

In summary, we can process streaming and batch data through Flink SQL queries and operations. It provides a way to simplify and accelerate data processing development, especially for developers and data engineers who are familiar with SQL.

what is connector

Flink Connector refers to the component used to connect external systems and data sources. It allows Flink to interact with different data sources through specific connectors, such as databases, message queues, file systems, etc. It handles tasks such as communication with external systems, data format conversion, data reading and writing, etc. Whether as an input data table or an output data table, data in external systems can be accessed and manipulated in Flink SQL by using appropriate connectors. Currently, the real-time platform provides many commonly used connectors:

For example:

  1. JDBC: used to establish connections with relational databases (such as MySQL, PostgreSQL), and supports reading and writing data from database tables in Flink SQL.

  2. JDQ: Used to integrate with JDQ and can read and write data in JDQ topics.

  3. Elasticsearch: For integration with Elasticsearch, data can be written to or read from the Elasticsearch index.

  4. File Connector: Used to read and write data in various file formats (such as CSV, JSON, Parquet).

  5. ......

There are also HBase, JMQ4, Doris, Clickhouse, Jimdb, Hive, etc., used to integrate with different data sources. By using Flink SQL Connector, we can easily interact with external systems, import data into Flink for processing, or export processing results to external systems.

DataGen Connector

DataGen is a built-in connector provided by Flink SQL for generating simulated test data for use during development and testing.

Using DataGen, you can generate data with different data types and distributions, such as integers, strings, dates, etc. This can simulate real data scenarios and help verify and debug Flink SQL queries and operations.

demo

The following is a simple example using the DataGen function:

-- 创建输入表
CREATE TABLE input_table (
 order_number BIGINT,
 price DECIMAL(32,2),
 buyer ROW<first_name STRING, last_name STRING>,
 order_time TIMESTAMP(3)
) WITH (
 'connector' = 'datagen',
);

In the above example, we created an input table named `input_table` using the DataGen connector. The table contains four fields: `order_number`, `price` and `buyer`, `order_time`. The default is random to randomly generate the corresponding type of data, and the production rate is 10,000 items/second. As long as the task does not stop, data will be continuously produced. Of course, you can also specify some parameters to define the rules for generating data, such as the number of rows generated per second, the data type and distribution of fields.

Generated data sample:

{"order_number":-6353089831284155505,"price":253422671148527900374700392448,"buyer":{"first_name":"6e4df4455bed12c8ad74f03471e5d8e3141d7977bcc5bef88a57102dac71ac9a9dbef00f406ce9bddaf3741f37330e5fb9d2","last_name":"d7d8a39e063fbd2beac91c791dc1024e2b1f0857b85990fbb5c4eac32445951aad0a2bcffd3a29b2a08b057a0b31aa689ed7"},"order_time":"2023-09-21 06:22:29.618"}
{"order_number":1102733628546646982,"price":628524591222898424803263250432,"buyer":{"first_name":"4738f237436b70c80e504b95f0d9ec3d7c01c8745edf21495f17bb4d7044b4950943014f26b5d7fdaed10db37a632849b96c","last_name":"7f9dbdbed581b687989665b97c09dec1a617c830c048446bf31c746898e1bccfe21a5969ee174a1d69845be7163b5e375a09"},"order_time":"2023-09-21 06:23:01.69"}

Supported types

Field Type Data generation method
BOOLEAN random
CHAR random / sequence
VARCHAR random / sequence
STRING random / sequence
DECIMAL random / sequence
TINYINT random / sequence
SMALLINT random / sequence
INT random / sequence
BIGINT random / sequence
FLOAT random / sequence
DOUBLE random / sequence
DATE random
TIME random
TIMESTAMP random
TIMESTAMP_LTZ random
INTERVAL YEAR TO MONTH random
INTERVAL DAY TO MONTH random
ROW random
ARRAY random
MAP random
MULTISET random

Connector properties

Attributes Is it required? default value type describe
connector required (none) String 'datagen'.
rows-per-second optional 10000 Long data production rate
number-of-rows optional (none) Long Specify the number of data items to be produced. The default is no limit.
fields.#.kind optional random String Specify the method of producing data for the field: random or sequence
fields.#.min optional (Minimum value of type) (Type of field) Random generator specifies field # minimum value, supports numeric type
fields.#.max optional (Maximum value of type) (Type of field) Specified field of random generator# Maximum value, supports numeric type
fields.#.length optional 100 Integer The length of char/varchar/string/array/map/multiset types.
fields.#.start optional (none) (Type of field) start value of sequence generator
fields.#.end optional (none) (Type of field) The end value of the sequence generator

DataGen uses

Now that you understand the basic usage of dategen, let’s practice it with other types of connectors.

Scenario 1: Generate 100 million pieces of data into hive table

CREATE TABLE dataGenSourceTable
 (
 order_number BIGINT,
 price DECIMAL(10, 2),
 buyer STRING,
 order_time TIMESTAMP(3)
 )
WITH
 ( 'connector'='datagen', 
 'number-of-rows'='100000000',
 'rows-per-second' = '100000'
 ) ;


CREATECATALOG myhive
WITH (
 'type'='hive',
 'default-database'='default'
);
USECATALOG myhive;
USE dev;
SETtable.sql-dialect=hive;
CREATETABLEifnotexists shipu3_test_0932 (
 order_number BIGINT,
 price DECIMAL(10, 2),
 buyer STRING,
 order_time TIMESTAMP(3)
) PARTITIONED BY (dt STRING) STORED AS parquet TBLPROPERTIES (
 'partition.time-extractor.timestamp-pattern'='$dt',
 'sink.partition-commit.trigger'='partition-time',
 'sink.partition-commit.delay'='1 h',
 'sink.partition-commit.policy.kind'='metastore,success-file'
);
SETtable.sql-dialect=default;
insert into myhive.dev.shipu3_test_0932
select order_number,price,buyer,order_time, cast( CURRENT_DATE as varchar)
from default_catalog.default_database.dataGenSourceTable;

When 100,000 pieces of data are produced per second, it can be completed in about 17 minutes. Of course, we can complete it faster by increasing the computing nodes of the Flink task, the degree of parallelism, increasing the value of the production rate 'rows-per-second', etc. Production of large amounts of data.

Scenario 2: Continuously produce 100,000 messages per second to the message queue

CREATE TABLE dataGenSourceTable (
 order_number BIGINT,
 price INT,
 buyer ROW< first_name STRING, last_name STRING >,
 order_time TIMESTAMP(3),
 col_array ARRAY < STRING >,
 col_map map < STRING, STRING >
 )
WITH
 ( 'connector'='datagen', --连接器类型
 'rows-per-second'='100000', --生产速率
 'fields.order_number.kind'='random', --字段order_number的生产方式
 'fields.order_number.min'='1', --字段order_number最小值
 'fields.order_number.max'='1000', --字段order_number最大值
 'fields.price.kind'='sequence', --字段price的生产方式
 'fields.price.start'='1', --字段price开始值
 'fields.price.end'='1000', --字段price最大值
 'fields.col_array.element.length'='5', --每个元素的长度
 'fields.col_map.key.length'='5', --map key的长度
 'fields.col_map.value.length'='5' --map value的长度
 ) ;
CREATE TABLE jdqsink1
 (
 order_number BIGINT,
 price DECIMAL(32, 2),
 buyer ROW< first_name STRING, last_name STRING >,
 order_time TIMESTAMP(3),
 col_ARRAY ARRAY < STRING >,
 col_map map < STRING, STRING >
 )
WITH
 (
 'connector'='jdq',
 'topic'='jrdw-fk-area_info__1',
 'jdq.client.id'='xxxxx',
 'jdq.password'='xxxxxxx',
 'jdq.domain'='db.test.group.com',
 'format'='json'
 ) ;
INSERTINTO jdqsink1
SELECT*FROM dataGenSourceTable;

think

As you can see from the above cases, data from various scenarios can be simulated through Datagen combined with other connectors.

  • Performance testing : We can use Flink’s high processing performance to debug the threshold of external dependencies of tasks (timeout, current limit, etc.) to a suitable water level to avoid the barrel effect of too many external dependencies on our tasks;
  • Boundary condition testing : We use Flink DataGen to generate special test data, such as minimum values, maximum values, null values, repeated values, etc. to verify the correctness and robustness of the Flink task under boundary conditions;
  • Data integrity testing : We use Flink DataGen to generate data sets containing incorrect or abnormal data, such as invalid data formats, missing fields, duplicate data, etc. This allows you to test the Flink task's ability to handle abnormal situations and verify whether the Flink task can correctly maintain data integrity when processing data.

In short, Flink DataGen is a powerful tool that can help testers construct various types of test data. With proper use, testers can conduct testing more efficiently and discover potential problems and defects.

Author: JD Retail Shi Pu

Source: JD Cloud Developer Community Please indicate the source when reprinting

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10117339