Fun headlines based Flink + ClickHouse build real-time data analysis platform

Abstract: This article headline data platform shared by the interested person in charge of Wang Jinhai, introduces interesting headlines Flink-to-Hive hour class scene and Flink-to-ClickHouse-second scene. Author: Wang Jinhai; Source: Yunqi community

Content is divided into four parts:

  • First, the business scenario and situation analysis
  • Two, Flink-to-Hive hour class scene
  • Three, Flink-to-ClickHouse-second scenes
  • Fourth, the future development and thinking

First, the business scenario and situation analysis

Page headline interest inquiry into off-line query page and real-time query page. Fun headlines reform implemented this year is access to the ClickHouse calculation engine in real-time query. According to different business scenarios, real-time data in the report will show a graph of data indicators index and detailed data tables. Currently metrics collection and calculation data every five minutes for a window of time, of course, there is a special case of a minute or three minutes. All data index data is derived from Kafka real-time data, and import ClickHouse calculated.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

Two, Flink-to-Hive hour class scene

1. FIG implementation architecture level h

Shown, Database are derived below Binlog to Kafka, while also Log Server data conference report Kafka. After landing in real time all the data to Kafka, drawn by Flink to HDFS. The figure after HDFS Hive between a broken line, i.e. not directly grounded to the Hive Flink, to HDFS Flink ground, then ground to Hive stage time may be hours, half hours or even minutes grade level, you need to know the data has been Event time to when, retriggering alter table, add partition, add location, etc., writes the partition.

Then a program is needed to monitor the current data Flink task has consumed time to what time, such as 9:00 of the data, the data need to see Kafka consumption has reached a 9-point landing, then trigger partition is written in the Hive.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

2. The principle

Fun headlines mainly use a feature Flink higher-order version of --StreamingFileSink. StreamingFileSink There are a few functions.

  • First, forBulkFormat support avro, parquet format, i.e., columns of storage formats.
  • Second, withBucketAssigner custom data by time-bucket here defines a EventtimeBucket, both the data to the floor by the time the data is offline.
  • Third, OnCheckPointRollingPolicy, according to the data CheckPoint landing time, the data and landing CheckPoint stabilized within a certain time. In accordance with the CheckPoint floor there are other strategies, such as in accordance with the data size.
  • Fourth, StreamingFileSink Exactly-Once semantics is implemented.

Flink has two semantic Exactly-Once implemented, the first one is Kafka, the second is StreamingFileSink. OnCheckPointRollingPolicy below shows the design of a landing every 10 minutes to HDFS file demo.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

■ How to achieve Exactly-Once

A left side view of the simple two-PC model. After sending a Coordinator prepare, by performing the operation start trigger ack, ack Coordinator receive all the messages, all the commit ack start trigger, for all actors floor, which was transformed into the model Flink, the Source receives checkpoint barrier stream, beginning trigger a snapshot.

After each operator CheckPoint, snapshot are completed, CheckPoint Job Manager will send notifyCheckpointComplete. Stage two left three lines and Flink model part of the model is consistent with FIG. Therefore Flink can be achieved with two-phase commit protocol.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

■ How to use Flink implement two-phase commit protocol

First, StreamingFileSink implement two interfaces, CheckpointedFunction and CheckpointListener. CheckpointedFunction achieve initializeState and snapshotState function. CheckpointListener is notifyCheckpointComplete way to achieve, so that the two interfaces can be implemented two-phase commit protocol semantics.

  • initializeState

initializeState when the task starts will trigger three actions. The first is commitPendingFile. There are three real-time data to the ground state Hdfs. The first state is in-progress, progress status. The second state is a pending state, the third state is a finished state.

initializeState when the task starts will trigger restoreInProgressFile, real-time operator writes. If the program appears successful CheckPoint has not been a problem, start again initializeState will commit PendingFile, then using Hadoop 2.7+ versions truncate way to reset or truncated in-progress file.

  • invoke

Real-time data is written.

  • snapshotState

CheckPoint will trigger in-progress file into pending state, while the length of recording data (TRUNCATE embodiment requires truncation length). snapshotState is not really write data to HDFS, but written ListState. Flink internal implementation Exactly-Once semantics when Barrier alignment state, but to achieve the outer end of Exactly-Once semantic difficult. Exactly-Once inside Flink achieved by ListState, all the data stored ListState, wait for all operators CheckPoint completed, then the data in the brush ListState in HDFS.

  • notifyCheckpointComplete

notifyCheckpointComplete triggers pending the finished state of the data written. Implementation is to rename, Streaming continue to write temporary files to HDFS, all actions after the end of official documents written by rename action.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

3. Cross-cluster multi nameservices

Fun headlines in real time and offline cluster is a cluster of independent, multiple sets of cluster offline, real-time cluster is currently set. Written to the offline cluster by cluster in real time, will produce HDFS nameservices problem. nameservices all offline in real time in the cluster with the cluster namenode HA in its entirety into the real cluster is not appropriate. So how to submit to each cluster offline by the cluster in real-time task?

As shown in the following Flink resource task, as shown below was added in the middle of HDFS xml. Add PropertyHong Kong in nameservices, such as real-time stream is namenode HA cluster configuration, data is namenode HA cluster to be written off-line configuration. Then the middle of the two clusters each HDFS set no modification, can be implemented directly on the client.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

4. Multi-user write permissions

To write real-time offline HDFS, it may involve user rights issues. Real-time user submissions have defined the user are the same user, but offline is a multi-user in all programs, thus giving real-time and offline users unequal. Added withBucketUser interesting headlines in the API to write HDFS. Once configured nameservices, next only need to know to write HDFS path through which a user, such as configuring user writes a stream.

API level is a benefit Flink program can specify several different HDFS and different users. Multi-user writing is a plus ugi.do as the Hadoop file system, the user agent. The above is interesting headlines Flink way to use real-time data synchronization to some of the work of the Hive. Which may appear small file problem, the file is small daemon periodically merge, if CheckPoint short time interval, such as 3 minutes, there will be a large number of small files problem.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

Three, Flink-to-ClickHouse-second scenes

1. FIG architecture implemented in seconds

Fun headlines There are a lot of real-time metrics, the average is calculated every five minutes or three minutes, if each real-time metrics with a Flink task, or a Flink SQL to write, such as consumption of a Kafka Topic, need to calculate the daily live, new, process etc. when the user put forward a new demand, the need to change the current Flink start a new task or tasks consume Flink Topic.

Flink task therefore continue to modify or continue the task of starting a new Flink problems. Fun headlines after trying to access ClickHouse Flink, to achieve the overall OLAP. The following figure shows the second level to achieve Chart. From Kafka to Flink, the Hive, to ClickHouse cluster, external docking Horizon (real-time report), QE (real time adhoc queries), Chihiro (data analysis), user portrait (circle of people in real time).

Fun headlines based Flink + ClickHouse build real-time data analysis platform

2.Why Flink+ClickHouse

  • Indicators of realization sql description : metrics analyst raised to SQL basically be described.
  • Index on downlines independently of each other : a task Flink consumer Topic, if need other indicators, can ensure that the index offline affect each other.
  • Data can be traced back to facilitate the anomaly investigation : to live the same day decline, the need backtracking investigation is a logical question which indicators caliber, such as differences in the data reported or data stream Kafka lost, or because the user does not report the index led to a decline in daily living, the Flink can not backtrack.
  • Fast computing, complete all the indicators calculated within a cycle : the need within five minutes of all the dimensions of the hundreds of indicators all calculations are complete.
  • Support real-time streaming, distributed deployment, operation and maintenance is simple : support for real-time streaming data Kafka.

Currently there are interesting headlines Flink cluster 100+ 32-core 128 G 3.5T SSD, the amount of data daily 2000 + billion in daily queries 21w + times, 80% of the inquiries completed within 1s. Single table below shows the test results. ClickHouse fast single-table test speed. But subject to architecture, ClickHouse the Join weak.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

The figure is a relatively complex process SQL, count + group by + order by, ClickHouse 2600000000 complete data calculation in 3.6s.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

3.Why ClickHouse so Fast

ClickHouse column using storage + LZ4, ZSTD data compression. Secondly, in conjunction with computation storage + localized to perform quantization. Presto data may be stored in the HDFS Hadoop cluster or in real time pulls data for calculation. ClickHouse storage and calculation means that there is localized the SSD each local computing machine need only count their own data, and then merge node. At the same time, LSM merge tree + Index. After the data is written ClickHouse, it will start a background thread data merge, do Index index. Such as the construction of a common index and DT-hour level data indexes to improve query performance. Fourth, SIMD + LLVM optimization. SIMD is a single instruction multiple data sets. Fifth, SQL syntax and UDF perfect. ClickHouse this great demand. Analysis of the data or require a higher dimension when pulled properties, as part of the function point of the time window.

  • Tree the Merge : shown below. The first layer is a real-time data is written. The background merge every level data. When the merge will sort the data, do Index index.
  • Connector ClickHouse : ClickHouse two concepts, Local table and Distributed table. Generally write Local table, read Distributed table. 5 ~ 10w ClickHouse generally a batch data write, 5s cycle. Fun headlines also achieved RoundRobinClickHouseDataSource.
  • BalancedClickHouseDataSource  : MySQL configure an IP and port number can write data, but BalancedClickHouseDataSource need to write Local table, it is necessary to know how many Local cluster tables, IP and port number for each Local table. If one hundred machines need to be one hundred machine's IP and port number of all configured, and then write. BalancedClickHouseDataSource two schedule. scheduleActualization and scheduleConnectionsCleaning. Configure one hundred machine's IP and port number, there will be some machines are not connected or does not respond to service problems, scheduleActualization will find the problem machine can not connect on a regular basis, triggering off the assembly line or delete IP and other activities. scheduleConnectionsCleaning be cleaned regularly ClickHouse useless http request.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

  • RoundRobinClickHouseDataSource: Fun headlines for BalancedClickHouseDataSource result of strengthened to achieve a three semantics. testOnBorrow set to true, try to see if you can ping to get a connection. When writing ClickHouse is a batch, then testOnReturn set to false, testWhileIdel set to true, fill in the official scheduleActualization and scheduleConnectionsCleaning functions. ClickHouse ongoing merge the background, if the background merge insert too slow to keep up with insert, an error occurs. It is necessary to write down constantly as possible, and other current machine finished, machine and then a write to write 5s interval, so that speed can merge as much as possible consistent with the insert speed.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

4.Backfill

When Flink import ClickHouse, the data query or impression reports will encounter some problems, such as Flink task fails, an error or data back-pressure, etc., or clusters appear ClickHouse not respond, zk can not keep up, insert too fast or cluster load, etc. problem, which can lead to problems throughout the mission.

If a sudden surge of data traffic, start Flink possibilities for some time continue to recover data appears, you need to adjust the degree of parallelism and other operating Flink help recover data. But this time there have been data backlog, if Flink also increase the degree of concurrency processing data, ClickHouse limit can not insert too fast, otherwise it will lead to a vicious cycle. So when Flink failure or malfunction ClickHouse cluster, wait ClickHouse cluster recovery, Flink task start spending from the latest data, the data is no longer chase over a period of time, by Hive import the data into ClickHouse.

Due to the previous data in real time by Kafka will fall to Hive, by writing data to ClickHouse the Hive. ClickHouse There partition, simply delete data on an hour, an hour Hive import data, you can continue the data queries. Backfill provides task-hour Flink level fault tolerance and clustering hours ClickHouse level fault tolerance mechanisms.

Fun headlines based Flink + ClickHouse build real-time data analysis platform

Future development and thinking

1.Connector SQL technology

Currently, Flink-to-Hive and Flink-to-ClickHouse headlines are more interesting scene cured, and the user need only specify the path HDFS, rest of the process can be described by way of SQL.

2.Delta lake

Flink is one batch flow calculation engine, but no one batch flow storage. Headline interest will be calculated using data HBase, Kudu, Redis like KV to interact in real time with the stored Flink. New issues such as calculation, the current program is interesting headlines will need to brush Hive history Redis or HBase users in real-time user interaction to determine whether new and Flink.

But the data Hive data and Redis in two because it is stored as data. Secondly Binlog involves extracting data delete operation, Hbase, Kudu support data modification, on a regular basis back in the Hive. Problem is caused by HBase, there is data in Kudu, Hive and save a copy of the data, and more data out of one or more shares. If you have one of the approved storage flow to support the above scenario, when Flink task over, can interact in real time and offline data, including real-time query Hive data in real time to determine whether the user to add, modify data in real time, update, or delete, also Hive can support batch operation of storage.

Future, consider Flink interesting headlines do flow batch of storage, so Flink unified stream ecology batch combination.

Author: Wang Jinhai, 10 years of Internet experience, has the only product will be responsible for user portrait system, provide crowd personalized marketing services; hungry, it served as architect, responsible for large data task scheduling, metadata development, portraits and other work tasks; now the interesting headline data center platform the person in charge, responsible for a large data base to calculate layer (spark, presto, flink, clickhouse), platform services layer (libra real-time calculation, kepler offline scheduling), data products layer (qe hoc queries, horizon data report ,), as well as team building metadata metadata, data permissions.

Published 50 original articles · won praise 1706 · Views 2.22 million +

Guess you like

Origin blog.csdn.net/zl1zl2zl3/article/details/105326825