[MySQL course] Dada Group real-time calculation task SQL practice

About the author: Ma Yangyang Dada Group data platform senior development engineer, responsible for the maintenance and development of Dada Group's computing engine

This article mainly introduces the practical experience of Dada Group in the process of SQLization of real-time computing tasks using Dada Flink SQL developed by open source Flink Stream SQL

01

background

Back in 2018, with the joint efforts of the data platform and the data team, we already have a complete offline calculation process, a perfect offline data warehouse model, and also launched many data products and a large number of data reports. With the development of business, we are gradually facing more and more real-time computing needs. With the gradual popularity of Flink in China, real-time computing is also increasingly entering our field of vision. At that time, Flink's SQL function was not perfect, and the functions required for large-scale data development could not be expressed in SQL. Therefore, our choice is similar to the choice of many companies. By encapsulating Flink's framework and API, it is difficult for our data developers to develop real-time tasks. In response to these needs, we plan to adopt some packaging, so that data development students do not need to develop Java or Scala code, focusing on the development of business logic. Due to limited development resources, we tend to accomplish this task by introducing some open source frameworks and conducting customized development. Through some research, we locked Flink Stream SQL (hereinafter referred to as FSL) of Kangaroo Cloud and AthenaX of Uber. After comparison, FSL's rich plug-ins, development activity and relatively complete support are more attractive to us. Therefore, we introduced FSL for Kangaroo Cloud, and developed Dada Flink SQL (hereinafter referred to as DFL) SQL engine based on FSL, and used this to SQLize real-time computing tasks.

02

Architecture

First introduce the architecture of DFL. The main components in the DFL are launcher, core, source plug-in, sink plug-in, Flink Siddhi plug-in and side plug-in. Among them, Flink Siddhi is a Siddhi-based rule engine that we access according to the open source Flink Siddhi. We will have a special article later. Flink Siddhi related content and the packaging we made. The launcher is responsible for loading the necessary source / side / sink plug-ins and submitting the Flink program to the Flink cluster, supporting session cluster mode and single job mode. The core module is responsible for parsing SQL statements, generating SQLTree, and loading corresponding plug-ins based on the parsed source, sink, Flink Siddhi, and side content, generating necessary components and registering into Flink TableEnvironment. After that, depending on whether SQL uses the dimension table JOIN function, it will choose to directly call TableEnvironment.sqlUpdate () or process the dimension table JOIN. In addition to the dimension table JOIN, according to the needs of our data development students, we also joined the support of INTERVAL JOIN. Using the process representation, the overall flow of DFL is shown in the figure below.

image

2.1 Parser

DFL uses Parser to parse SQL statements, parse them into corresponding data structures, and put them into SqlTree for management for subsequent use. Parser defines a good interface, and it is easy to add support for new SQL syntax by adding new implementation classes. The interface of Parser is defined as follows:

image

Among them, match is used to judge whether a specific Parser implementation can parse a given SQL statement. VerifySyntax is a newly added interface function for us to verify whether the syntax of a given SQL is correct and put related error information Into errorInfo for the caller to use, parserSql to achieve specific SQL syntax analysis. We added a lot of implementations for IParser to achieve new functions, such as adding support for Flink Siddhi.

2.2 Dimension table JOIN

There are two ways to realize JOIN in the DFL: ALL and SIDE. The ALL method will read and cache the data that needs JOIN to the Task's memory at one time, and can set the cache to be refreshed regularly; the SIDE method reads the corresponding data from the corresponding data source when JOIN is needed, and according to the settings Decide whether to cache the read data in memory. The corresponding abstract classes of ALL and SIDE modes are defined as AllReqRow and AsyncReqRow respectively. They all implement the common interface ISideReqRow. ISideReqRow defines a method for JOIN the data of the fact table and the data read by the dimension table. Row fillData Row input, Object sideInput). The definitions of AllReqRow and AsyncReqRow are as follows:

image

You can see the design pattern in which the template method is used.

image

AsyncSideReqRow mainly provides the default processing method when initializing the LRU cache, obtaining data from the LRU cache, and unable to find data requiring JOIN from the data source or the LRU cache.

03

Added functions and improvements

In the process of developing DFL, based on some business-related needs and the needs of simplified data developers to use DFL, we have done a lot of improvement and expansion on the basis of native FSL. Here are some of the work we do on DFL.

3.1 In Flink HA mode, submission of tasks in SESSION mode timed out

In order to have better fault tolerance of Flink tasks, we configured ZooKeper-based HA for Flink clusters. For the purpose of task management and maintenance, some of our Flink tasks use the session mode. After migrating these tasks to the DFL, we will report a timeout error when submitting the tasks. Checking Flink's official documentation also found no clues. After our exploration later, it was found that in the YARN session mode, when HA is configured, high-availability.cluster-id needs to be specified for task submission. After adding the following code, the task can be submitted normally in SESSION mode.

image.gif

3.2 Kafka supports the use of SQL keywords as JSON field names

When the SQL keyword is used as the field name in Flink, even if the field name is wrapped with backticks, the following error will still be reported:

image.gif

This is a Flink bug, which has been fixed in 1.10.1, see this issue for details: https://issues.apache.org/jira/browse/FLINK-16526. The version we use is Flink 1.6.2 and this fix cannot be used. Our approach is to support the decoupling of the JSON field name in Kafka and the column name referencing this JSON field, that is, the Flink SQL uses the specified column name to refer to the JSON field, and the original JSON field name is used for JSON parsing. Specifically, in the metadata system, we support registering an optional sourceName for Kafka-type tables. If sourceName is registered, Flink Stream SQL will use sourceName to parse the corresponding field in JSON.

3.3 Metadata integration

After the DFL went online, by adding the necessary functions, using pure SQL development has met many of our real-time task development needs. However, after running the DFL for a period of time, we noticed the troubles caused by managing various upstream and downstream stored information to our data developers. The storage systems we use online include Kafka, HBase, ElasticSearch, Redis, and MySQL (ClickHouse was later introduced). These data sources are basically heterogeneous, with different connections and user information, and the same data source is used in different tasks. Each time you need to use the CREATE TABLE <table_name> () WITH () syntax to convert the field information and Repeat the connection information. In response to this problem, inspired by Hive metadata, we decided to develop our own real-time metadata management system to manage these real-time data sources. The architecture of our metadata management system is shown below.

image

After the development of the metadata management system was completed, we deeply integrated Flink Stream SQL and the metadata management system. By introducing the syntax of USE TABLE <> AS <> WITH (), our data developers only need to register the data source in the metadata management system, and then reference the registered table in Flink Stream SQL without filling in any more Connection information, and if you need to refer to all fields, you do not need to fill in the field information. If you don't want to quote all subsections, there are two ways to do it. The first method is to use columns to express the fields that need to be referenced in USE TABLE WITH. The second method is to register a table in the metadata system that contains only the fields to be referenced.

3.4 Redis hash / set data type support

FSL has built-in support for Redis as a sink table and side table, but FSL only supports Redis String data, and our scene will use Redis hash and set data, so we need to add Redis Support for various data types. First introduce the method of mapping the data in Redis to the table in Flink. Our Redis key contains two parts (separated by ":"), the two parts are fixed keyPrefix and one to more The value of this field uses ":" spliced ​​primaryKey, where keyPrefix simulates the concept of a table, and also facilitates the management of the content stored in Redis. For String data, the Redis key will be stitched with the field name (using ":" as the delimiter) on the basis of the key introduced above, and the field value will be written into Redis with the value corresponding to the key; for Hash For the type of data, the complete key of Redis is the key described above. The hash key is formed by using the value of the field specified by the user ":". Similarly, the value of hash is spliced ​​by the value of the field specified by the user. to make. In addition to the support of Redis hash and set data types, we have also added setnx and hsetnx and TTL functions to Redis.

3.5 ClickHouse sink support

FSL has built-in support for data sources such as Kafka, MySQL, Redis, Elasticsearch and HBbase as target tables, but we also encountered some new data sources as the target write end during the use process. Sink plugin to support this demand. The sink plugins we develop and maintain include ClickHouse and HdfsFile. Let's take ClickHouse's sink as an example to introduce some of the work we have done in this regard.

For ClickHouse, we have developed ClickhouseSink that implements RichSinkFunction and CheckpointedFunction. By implementing CheckpointedFunction and flashing data to ClickHouse in the snapshotState () method to ensure that data will not be lost. In order to deal with different input data types, we provide the interface ClickhouseMapper for mapping input data to org.apache.flink.types.Row type data. ClickhouseMapper is defined as follows.

image

Unlike the usual way for users to provide the schema of the sink table, we implement DESC

The way to get the table schema from ClickHouse. In order to deal with the special data types in ClickHouse, such as nullable (String), Int32, etc., we use regular expressions to extract the actual type for writing, the relevant code is as follows.

image.gif

In order to write data without blocking the normal data processing flow, we used the method of putting data writing tasks into the thread pool. At the same time, in order to avoid data loss when the Flink task fails, wait for the task in the thread pool to complete in the snapshotState () method.

3.6 Simplification of BINLOG expression

In order to process the update of online data, we adopted Alibaba's open source Canal to collect MySQL binlog and send it to Kafka. Due to the special data organization form of binlog, a lot of complicated work needs to be done to process the data of binlog, such as using udf to extract the actual added or updated fields from the columnValues ​​or updatedValues ​​fields of binlog. Because we link Flink Stream SQL with the metadata system, we can get the schema information of MySQL tables, so we can provide syntax encapsulation to help data developers reduce this repetitive SQL expression. To this end, we introduce a new SQL syntax: USE BINLOG TABLE, the format of this syntax is as follows.

image.gif

   我们会将这种语法展开为如下的内容。

image.gif

04

application

After the DFL goes online, since it can be developed using pure SQL, which is in line with the development habits of data development students, and we provide a lot of syntax encapsulation, plus the convenience brought by metadata management, data development students gradually migrate some real-time computing tasks When it came to the DFL, this brought a great efficiency improvement to the department. Up to now, DFL has been applied to various data application systems of Dada Group. The real-time calculation tasks running in the system have reached more than 70, covering various business and traffic modules of Dada Express and JD Daojia, and real-time calculation tasks The number and the proportion of SQLization are still steadily increasing. With the opening of the computing infrastructure of the big data department, our real-time computing capabilities are now being used more and more widely in other groups of the group.

05

future plan

The current community version of Flink has been developed to 1.10. Flink Table / SQL itself already supports most of the functions provided by DFL. In order to reduce the complexity of maintaining components, we plan to introduce Flink 1.10 in the future and gradually promote the use of Flink 1.10 in order to Finally, all tasks are migrated to the latest Flink version.

The company is gradually promoting the use of private clouds. Considering the community's progress on Flink on K8s, we will try to deploy on the company's private cloud when we introduce a new version of Flink.

Service recommendation

Published 0 original articles · liked 0 · visits 354

Guess you like

Origin blog.csdn.net/weixin_47143210/article/details/105628572