Real-time data warehouse construction Question 2: How to use flink sql to quickly and mindlessly count the number of orders placed in each process (delivered, confirmed receipt, etc.) of the day

Real-time statistics of the number of orders in each process status of the day's order (paid for delivery by the seller, seller notifying the logistics to collect, waiting for the buyer to receive the goods, etc.).
The binlog data of the order table is sent to Kafka, and Flink receives messages from Kafka for indicator statistics. Because the status of each order will change, for example, in the morning it is [Paid and awaiting delivery from the seller], at this time the index of [Paid and awaiting delivery by the seller] needs to be +1, and the status of the order in the afternoon is changed to [Seller notifies logistics to collect], at this time the indicator of [Seller notifies logistics to collect] needs to be +1, while the indicator of [Paid and awaiting delivery from the seller] needs to be -1.

If you use Java code programming, you need to deeply understand the business and consider each state change. If you write a lot of if logic, you will get wrong results if you leave a little bit. But if you use flink sql, don't you need to consider these business issues?
to make

select order_status,count(order_no) from order group by order_status

To get the desired result, the data flow entering the SQL must be changed from an append flow to an update/retract flow. Otherwise, the subtraction logic should be written according to the update/delete status of the binlog message.

Solutions

  • Flink data consumption does not use kafka, but directly uses flink cdc to consume binlog logs of the database.
  • Flink data consumption uses kafka, then kafka ddl uses canla-json format.
  • If the current flink version does not support the canla-json format, it is necessary to convert the append stream from source into an update/retract stream and then enter our aggregation SQL operator.

As long as the source end generates changelog data, subsequent operators can automatically process update messages. You can think of it as:

  • insert/ update_after messages will be added to aggregate metrics
  • delete / update_before messages will be retracted from aggregate metrics

Original purpose of the column:

  • If you want to quickly build a real-time data warehouse and align the layers of offline data warehouses, Flink SQL is the first choice. Compared with datastream code, Flink SQL can greatly increase the real-time data warehouse construction implementation time by 10 times.
  • The author is located in the real-time data warehouse team of Dachang, currently running 3000+ real-time tasks, the real-time cluster scale is 20,000 CU, the peak value of cluster checkpoint is 5TB, and the maximum QPS peak value of a single task is 50W.
  • This column will share the details encountered by the author in the process of building a real-time data warehouse, and help everyone quickly build a real-time data warehouse.

author information:

Supongo que te gusta

Origin blog.csdn.net/hbly979222969/article/details/131151314
Recomendado
Clasificación