Summary of Flink common interview questions

Question: Why use Flink instead of Spark?

Answer : The main consideration is flink's low latency, high throughput and better support for streaming data application scenarios; in addition, flink can handle out-of-order data well, and can ensure exactly-once state consistency. For details, please refer to the first chapter of the document. There is a detailed comparison between Flink and Spark.

Question: Where is Flink's checkpoint?

Answer : It can be memory, file system, or RocksDB.

Question: If the lower-level storage does not support transactions, how does Flink guarantee exactly-once?

Answer : Exactly-once end-to-end has higher requirements for sink, and the specific implementation mainly includes two ways of idempotent writing and transactional writing. The scenario of idempotent writing depends on business logic, and transactional writing is more common. And transactional write has two ways: write ahead log (WAL) and two-phase commit (2PC).
If the external system does not support transactions, you can use the pre-write log method to save the result data as a state first, and then write it to the sink system once when the checkpoint is completed.

Question: Tell me about the Flink state mechanism?

Answer : Many operators built in Flink, including source and data storage sink, are all stateful. In Flink, states are always associated with specific operators. Flink will take a snapshot of the state of each task in the form of checkpoint to ensure state consistency during failure recovery. Flink manages the storage of state and checkpoint through the state backend. The state backend can have different configuration options.

Question: What are the differences and advantages of Flink's checkpoint mechanism compared to spark?

Answer : The checkpoint of spark streaming is only a checkpoint of data and metadata for the failure recovery of the driver. The checkpoint mechanism of flink is much more complicated. It uses lightweight distributed snapshots to achieve snapshots of each operator and the data in the flow.

Question: Please explain in detail Flink's Watermark mechanism.

Answer : Watermark is essentially a mechanism for measuring the progress of EventTime in Flink, which is mainly used to process out-of-order data.

Question: How is exactly-once semantics implemented in Flink, and how is state stored?

Answer : Flink relies on the checkpoint mechanism to achieve exactly-once semantics. If you want to achieve end-to-end exactly-once, you also need external sources and sinks to meet certain conditions. State storage is managed through the state backend, and different state backends can be configured in Flink.

Question: In Flink CEP programming, where will the data be stored when the state is not reached?

Answer : In streaming processing, CEP must of course support EventTime, so the corresponding data delay phenomenon must also be supported, which is the processing logic of watermark. CEP's processing of unmatched sequence of events is similar to that of late data. In the processing logic of Flink CEP, the unsatisfied and late data will be stored in a Map data structure. That is to say, if we limit the duration of the judgment event sequence to 5 minutes, then 5 minutes will be stored in the memory The data, in my opinion, is also one of the great damage to memory.

Question: What are the three time semantics of Flink, and what are the application scenarios?

Answer : 1.
Event Time: This is the most common time semantics in practical applications.

2. Processing Time: When there is no event time, or when the real-time requirements are extremely high.

3. Ingestion Time: When there are multiple Source Operators, each Source Operator can use its own local system clock to assign the Ingestion Time. All subsequent operations based on time will use the Ingestion Time in the data record.

Question: How do Flink programs deal with data peak periods?

Answer : Use large-capacity Kafka to put data in the message queue as a data source, and then use Flink for consumption, but this will affect a little real-time performance.

Question: How to remove weight? Consider a real-time scenario: Double Eleven scenario, sliding window length is 1 hour, sliding distance is 10 seconds, 100 million users, how to calculate UV?

Answer : It is obviously impossible to use a set data structure similar to scala or redis set, because there may be hundreds of millions of keys, which cannot be stored inside. So you can consider using Bloom Filter to remove duplication.

Question: How does the company submit real-time tasks and how many Job Managers does it have?

Answer :

  1. We use yarn session mode to submit tasks. Each submission will create a new Flink cluster, and provide a yarn-session for each job. The tasks are independent of each other and do not affect each other, which is convenient for management. The cluster created after the task is executed will also disappear. Script command line as follows:
    . Bin / -s 8. 7 -n yarn-session.sh JM 3072 32768 -qu -TM the root . -NM - -d
    wherein Application taskmanager 7, each of the 8 cores each have taskmanager 32768M memory.
    2. The cluster has only one Job Manager by default. But in order to prevent a single point of failure, we have configured high availability. Our company generally configures one main Job Manager, two backup Job Managers, and then combines the use of ZooKeeper to achieve high availability.

Guess you like

Origin blog.csdn.net/Poolweet_/article/details/108714102