Practical exploration of developing FlinkSQL tasks based on the Kangaroo Cloud real-time development platform

With the development of business, real-time scenarios are becoming more and more important in various industries. Whether it is finance, e-commerce or logistics, real-time data processing has become a key link. With its powerful stream processing features , window operations, and support for various data sources, Flink has become the preferred development tool in real-time scenarios.

FlinkSQL provides a more friendly interactive method for data development through the SQL language, but there are still big differences between its development method and offline development of SparkSQL. Kangaroo Cloud's real-time development platform StreamWorks has been committed to lowering the development threshold of FlinkSQL, allowing more data developers to master real-time development capabilities, and popularizing real-time computing applications .

This article will briefly introduce four ways to develop FlinkSQL tasks on the Kangaroo Cloud real-time development platform .

script mode

This mode is the most basic development method. Data developers complete Flink table definition and business logic processing through FlinkSQL code in the platform IDE . code show as below:

-- 定义数据源表
CREATE TABLE server_logs (
client_ip STRING,
client_identity STRING,
userid STRING,
user_agent STRING,
log_time TIMESTAMP(3),
request_line STRING,
status_code STRING,
size INT
) WITH (
'connector ' = 'faker ',
'fields .client_ip .expression ' = '#{Internet .publicIpV4Address} ',
'fields .client_identity .expression ' =  '- ',
'fields .userid .expression ' =  '- ',
'fields .user_agent .expression ' = '#{Internet .userAgentAny} ',
'fields .log_time .expression ' =  '#{date .past ' '15 ' ', ' '5 ' ', ' 'SECONDS ' '} ',
'fields .request_line .expression ' = '#{regexify ' '(GET |POST |PUT |PATCH){1} ' '} #{regexify ' '(/search\ .html|/login\ .html|/prod\ .html|c
'fields .status_code .expression ' = '#{regexify ' '(200 |201 |204 |400 |401 |403 |301){1} ' '} ',
'fields .size .expression ' = '#{number .numberBetween ' '100 ' ', ' '10000000 ' '} '
);

-- 定义结果表,  实际应用中会选择  Kafka、JDBC 等作为结果表
CREATE TABLE client_errors (
log_time TIMESTAMP(3),
request_line STRING,
status_code STRING,
size INT
) WITH (
'connector ' = 'stream-x '
);

-- 写入数据到结果表
INSERT INTO client_errors
SELECT
log_time,
request_line,
status_code,
size
FROM server_logs
WHERE status_code SIMILAR TO '4[0-9][0-9] ';

Pros and Cons of Script Mode

Advantages: high flexibility.

Disadvantages: Flink table definition logic is complicated. If you are not familiar with the data source plug-in, it is difficult to remember which parameters need to be maintained; if the task involves multiple tables, there will be a large section of table definition code in the code block, which is inconvenient to troubleshoot business logic.

wizard mode

Based on the shortcomings of the script mode, the Kangaroo Cloud real-time development platform abstracts the Flink table definition logic into a visual configuration function, guides data developers to complete the Flink table definition through page configuration, and makes data development more focused on business logical processing.

file

The wizard mode is to complete the mapping of the source table, dimension table, and result table of the Flink table according to the page guidance in the configuration items of the development page, and then directly reference it in the IDE to read and write the corresponding Flink table to complete the logic development.

The platform provides common configuration items of source table, dimension table and result table of various data sources by default;

· For various advanced parameters, the platform also provides a key/value method for maintaining custom parameters to meet flexibility requirements.

Catalog mode

In the wizard mode, we can quickly complete the table mapping by means of configuration, but there is also a problem that these mapping tables can only be referenced in the current task and cannot be reused in different tasks.

However, in the process of real-time data warehouse construction, we often encounter the following scenario: a kafka topic at the dws level will be used as the source table in multiple ads tasks. In the development process of each ads task , the same Flink mapping needs to be done once for the same dws topic.

In order to solve the development work of this repeated mapping, we can use the Flink Catalog function to persistently store the metadata information of the mapping table, so that it can be repeatedly referenced in different tasks. The specific usage method is as follows (take the DT Catalog of the platform as an example):

Catalog catalog maintenance

· First create a catalog named stream_warehouse under DT Catalog

Then create different databases under the catalog according to the data warehouse level or business domain

file

Flink mapping table creation

Method 1: Hover the database in the directory, and complete the Flink table mapping in a configurable way according to the guide

file

Method 2: In the IDE, complete the creation through Create DDL , pay attention to specify the corresponding catalog.database path

CREATE TABLE stream_warehouse .dws .orders (
order_uid  BIGINT,
product_id BIGINT,
price      DECIMAL(32, 2),
order_time TIMESTAMP(3)
) WITH (
'connector ' = 'datagen '
);

FlinkSQL task development

After completing the above two steps, a Flink mapping table for metadata persistent storage is created. When we develop tasks, we can directly refer to the tables we need through catalog.database.table.

INSERT INTO stream_warehouse .ads_db .client_errors
SELECT
log_time,
request_line,
status_code,
size
FROM stream_warehouse .dws_db .server_logs

demo mode

After learning the above three development methods, if you are still unfamiliar with the development logic of FlinkSQL, it is recommended that you complete a complete task development through the code template center of the Kangaroo Cloud real-time development platform .

In the template center, we provide more than 20 common business scenarios and their corresponding FlinkSQL code logic, such as the writing of various windows, the writing of various Joins, etc. You can apply them according to real business scenarios Use templates to quickly complete task development.

file file

Summarize

Each development model is not absolutely good or bad. Different development models need to be adopted according to the real-time computing scenarios and stages of different enterprises in order to truly achieve the goal of reducing costs and increasing efficiency.

When the enterprise is new to real-time computing and the data developers are not familiar with FlinkSQL, the DEMO mode is the best choice;

When the enterprise has already started real-time computing, but the task load is not large, the script mode or wizard mode is a good choice;

· When the enterprise's real-time computing reaches a certain scale and requires a management method similar to an offline data warehouse, the Catalog mode is the best choice.

"Dutstack Product White Paper": https://www.dtstack.com/resources/1004?src=szsm

"Data Governance Industry Practice White Paper" download address: https://www.dtstack.com/resources/1001?src=szsm If you want to know or consult more about Kangaroo Cloud big data products, industry solutions, and customer cases, visit Kangaroo Cloud official website: https://www.dtstack.com/?src=szkyzg

At the same time, students who are interested in big data open source projects are welcome to join "Kangaroo Cloud Open Source Framework DingTalk Technology qun" to exchange the latest open source technology information, qun number: 30537511, project address: https://github.com/DTStack

Graduates of the National People’s University stole the information of all students in the school to build a beauty scoring website, and have been criminally detained. The new Windows version of QQ based on the NT architecture is officially released. The United States will restrict China’s use of Amazon, Microsoft and other cloud services that provide training AI models . Open source projects announced to stop function development LeaferJS , the highest-paid technical position in 2023, released: Visual Studio Code 1.80, an open source and powerful 2D graphics library , supports terminal image functions . The number of Threads registrations has exceeded 30 million. "Change" deepin adopts Asahi Linux to adapt to Apple M1 database ranking in July: Oracle surges, opening up the score again
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/10086506