The most complete Flink interview questions in history, a must for high salaries, a collection of big data interviews

say up front

This article "Nin's Big Data Interview Collection" is a companion article to " Nin's Java Interview Collection ".

Here is a special explanation: Since the first release of the 41 topic PDFs of " Nin 's Java Interview Collection ", thousands of questions have been collected. I joined a big factory and got a high salary .

The collection of interview questions in "Nin's Java Interview Collection" has become a must-read book for Java learning and interviews.

Therefore, the Nien architecture team took advantage of the heat and launched the "Neon Big Data Interview Collection" , which has released several topics:

" Nin's Big Data Interview Collection Topic 1: The Most Complete Hadoop Interview Questions in History "

" Nin's Big Data Interview Collection Topic 2: Top Secret 100 Spark Interview Questions, Memorized 100 Times, Get a High Salary "

" Nin's Big Data Interview Collection Topic 3: The Most Complete Hive Interview Questions in History, Continuously Iterating and Continuously Upgrading "

"Nin's Big Data Interview Collection Topic 4: The Most Complete Flink Interview Questions in History, Constantly Iterating and Continuously Upgrading" (this article)

"Nin's Big Data Interview Collection" will continue to be upgraded and iterated in the future, and it will become a must-read book for learning and interviewing in the field of big data , helping everyone grow into a three-in-one architect , enter a large factory, and get a high salary.

PDF of "Nin Architecture Notes", "Nin High Concurrency Trilogy" and " Nin Java Interview Collection ", please go to the official account [Technical Freedom Circle] to get it

Article directory

about the author

First work: Mark , senior big data architect, Java architect, nearly 20 years of experience in Java, big data architecture and development. Senior architecture mentor, successfully guided multiple intermediate Java and senior Java transformation architect positions.

Second work: Nien , a 41-year-old senior architect, senior writer and famous blogger in the IT field. The creator of "Java High Concurrency Core Programming Enhanced Edition Volume 1, Volume 2, Volume 3". Author of 11 PDF Bibles including "K8S Study Bible", "Docker Study Bible", "Go Study Bible". He is also a senior architecture instructor and architecture transformation instructor . He has successfully guided a number of intermediate Java and senior Java transformation architect positions. The highest annual salary of the students is nearly 1 million .

The most complete Flink interview questions in history

1. A brief introduction to Flink

Flink is a framework and distributed processing engine for stateful computation on unbounded and bounded data streams. And Flink provides core functions such as data distribution, fault tolerance mechanism and resource management. Flink provides many high-abstract APIs for users to write distributed tasks:

  • DataSet API performs batch operations on static data and abstracts static data into distributed datasets. Users can easily use various operators provided by Flink to process distributed datasets, supporting Java, Scala and Python.
  • DataStream API performs stream processing operations on data streams, and abstracts streaming data into distributed data streams. Users can easily perform various operations on distributed data streams, and supports Java and Scala.
  • Table API , which performs query operations on structured data, abstracts structured data into relational tables, and performs various query operations on relational tables through SQL-like DSL, and supports Java and Scala.

In addition, Flink also provides domain libraries for specific application areas, such as: Flink ML, Flink's machine learning library, provides machine learning Pipelines API and implements a variety of machine learning algorithms. Gelly, Flink's graph computing library, provides graph computing-related APIs and various graph computing algorithm implementations.

2. What are the main features of Flink?

Key features of Flink include:

  • Integration of stream processing and batch processing : Flink supports both stream processing and batch processing, and can seamlessly switch between stream processing and batch processing.
  • Event-driven processing model : Flink uses the concepts of event time and processing time, supports event-based processing and window operations, and is suitable for real-time data processing and analysis.
  • High performance and low latency : Flink's optimization engine can achieve high throughput and low latency data processing, which is suitable for application scenarios that require fast response.
  • Fault tolerance and reliability : Flink has a fault tolerance mechanism that can ensure the correctness and consistency of data processing when a node fails.
  • Flexible programming model : Flink supports multiple programming models, including stream-based API (DataStream API) and batch-based API (DataSet API), and provides multiple programming language interfaces.

3. What are the application scenarios of Flink?

Flink is suitable for the following application scenarios:

  1. Real-time data processing and analysis : Flink can process real-time data streams, support real-time data processing and analysis, and is suitable for scenarios such as real-time monitoring, real-time reporting, and real-time analysis.
  2. Batch processing tasks : Flink can process bounded data sets, support batch processing tasks, and is suitable for scenarios such as offline data processing and large-scale data analysis.
  3. Event-based applications : Flink's event-driven processing model is suitable for building event-based applications, such as real-time recommendation systems, fraud detection, and real-time prediction scenarios.
  4. Stream-batch integration application : Flink's stream-batch integration feature enables the combination of streaming and batch processing, which is suitable for application scenarios that require a combination of real-time and offline processing.
  5. Data mining and machine learning : Flink can handle large-scale data sets and supports various data mining and machine learning algorithms, suitable for building large-scale data mining and machine learning applications.
  6. Real-time calculation and decision-making : Flink supports real-time calculation and decision-making, and can make real-time decisions and actions based on real-time data streams. It is suitable for scenarios that require real-time decision-making and actions, such as real-time pricing and real-time advertising.
  7. IoT applications : Flink can handle large-scale real-time data streams, and is suitable for processing real-time data in IoT applications, such as smart homes, smart cities, and smart transportation.

Flink is a general big data processing framework, which can be applied to various large-scale data processing and analysis scenarios, especially for scenarios that require real-time processing and analysis.

4. What is the Flink programming model?

In fact, in one sentence, it is Source->Transformation*->Sink

The Flink programming model is a programming model for processing streaming data, which includes three core concepts: Source, Transformation and Sink. The data flow starts from Source, goes through multiple Transformation operations, and finally reaches the end of Sink. In this process, data can be processed, filtered, transformed, aggregated, etc. to achieve real-time data processing and analysis.

Specifically, in the Flink programming model, developers need to first specify the source of the data, that is, the source of the data, which can be files, network data streams, databases, etc. Then, the data is processed through a series of Transformation operations, such as filtering, mapping, aggregation, windowing and other operations. These Transformation operations can be combined to enable complex data processing and analysis. Finally, the processed data is sent to the Sink, that is, the whereabouts of the data, which can be files, network data streams, databases, etc.

The Flink programming model supports event time semantics, that is, data processing is sorted and processed according to the time when events occur. At the same time, Flink also supports functions such as window operation, state management, and event processing to achieve more complex data processing and analysis scenarios.

5. Tell me about the operating architecture of Flink. What are the roles of the Flink cluster? What is the role of each?

Flink programs mainly have three roles: TaskManager, JobManager, and Client at runtime.

When the Flink cluster is started, a JobManger and one or more TaskManagers will be started first. The Client submits the task to the JobManager, and the JobManager then schedules the task to each TaskManager for execution, and then the TaskManager reports the heartbeat and statistical information to the JobManager. Data is transmitted between TaskManagers in the form of streams. The above three are independent JVM processes.

  • JobManager : It plays the role of the manager Master in the cluster. It is the coordinator of the entire cluster, responsible for receiving Flink Jobs, coordinating checkpoints, Failover fault recovery, etc., and managing the slave node TaskManager in the Flink cluster.
  • TaskManager : It is the Worker that is actually responsible for performing calculations, and executes a set of tasks of Flink Job on it. Each TaskManager is responsible for managing resource information on its node, such as memory, disk, and network. JobManager report.
  • Client : It is the client submitted by the Flink program. When the user submits a Flink program, a Client will be created first. The Client will first preprocess the Flink program submitted by the user and submit it to the Flink cluster for processing. Therefore, the Client needs to start from Obtain the address of the JobManager from the Flink program configuration submitted by the user, establish a connection to the JobManager, and submit the Flink Job to the JobManager.

6. What are the cluster deployment modes of Flink?

Flink's cluster deployment modes include:

  • Stand-alone mode : run a Flink cluster on a single machine, suitable for development and test environments.
  • Local mode : Simulate a Flink cluster on multiple threads locally, suitable for development and debugging tasks.
  • Separate deployment : Deploy JobManager and TaskManager on different machines, which is suitable for production environments and large-scale task execution.
  • Embedded mode : Integrate Flink into existing applications and use it as a library, suitable for scenarios where stream processing capabilities need to be integrated into other applications.

7. How big was your Flink cluster before?

The cluster size is usually closely related to the company's business needs, data volume, computing resources and other factors.

In practical applications, the size of a Flink cluster may vary from a few to dozens or even hundreds of nodes. The size of the cluster depends on business needs and data processing capabilities. Large Internet companies may need to process more data and requests, so their Flink cluster size may be larger.

In terms of deployment methods, most companies use YARN mode for deployment. YARN provides a distributed resource management method, which can better schedule and manage resources in the cluster. According to the company's needs and resource configuration, YARN can choose different deployment modes, such as single job mode or batch mode. The choice of these deployment modes affects the size and performance of the cluster.

In short, the size of the Flink cluster will vary according to the company's actual needs and resource allocation, and needs to be evaluated and optimized according to the specific situation.

8. Talk about Flink cluster optimization

Flink cluster optimization is a key step to improve the performance of Flink cluster.

Here are some suggestions for Flink cluster optimization:

  1. taskmanager.heap.mb tuning : taskmanager.heap.mb is the size of the Flink task manager heap memory, the default is 1024MB. If higher memory is required, it can be adjusted to 2048MB or higher. This ensures that Task Manager has enough memory to process data and execute tasks.
  2. Adjust the parallelism of executing tasks : The parallelism of Flink tasks can be adjusted through task properties. Increasing the degree of parallelism can increase the execution speed of tasks, but it will also increase the usage of memory and CPU. Therefore, it is necessary to adjust the parallelism of the task according to the specific situation.
  3. Optimizing task scheduling : Flink task scheduling can be optimized in several ways. For example, the number of taskmanagers and the allocation strategy can be adjusted to ensure that tasks are distributed evenly across the different taskmanagers. Task priorities and resource requirements can also be adjusted to ensure tasks get the resources they need first.
  4. Optimize network configuration : The network configuration of the Flink cluster also has a great impact on performance. For example, the connection between taskmanagers can be adjusted to ensure that task data can be transferred quickly. Network bandwidth and latency can also be adjusted to ensure tasks are completed within the specified time.
  5. Optimizing state management : The state management of Flink tasks is also an important optimization aspect. For example, Flink's state backup and restore features can be used to ensure that the task state can be synchronized between different nodes in the cluster. You can also adjust how and where state is persisted to ensure state data is not lost.
  6. Use Flink's advanced optimization features : Flink also provides many advanced optimization features, such as code generation, optimizers, and iteration operators. These functions can significantly improve the performance of the Flink cluster, but need to be adjusted and used according to the specific situation.

To sum up, Flink cluster optimization requires comprehensive consideration of multiple aspects, including memory management, task scheduling, network configuration, state management, and advanced optimization functions. By adjusting these parameters and configurations, the performance and efficiency of the Flink cluster can be significantly improved.

9. How does the company submit real-time tasks and how many Job Managers are there?

1) We use the yarn session mode to submit tasks; another way is to create a new Flink cluster for each submission to provide resources for each job. The tasks are independent of each other and do not affect each other, which is convenient for management. The cluster created after the task execution is completed will also disappear. The online command script is as follows:

bin/yarn-session.sh -n 7 -s 8 -jm 3072 -tm 32768 -qu root.*.* -nm *-* -d

Among them, apply for 7 taskManagers, each with 8 cores, and each taskmanager has 32768M memory.

2) The cluster has only one Job Manager by default. But in order to prevent single point of failure, we have configured high availability. For the standlone mode, our company generally configures one main Job Manager and two backup Job Managers, and then combines the use of ZooKeeper to achieve high availability; for the yarn mode, yarn will automatically restart when the Job Manager fails, so only one is needed. We The maximum number of restarts configured is 10.

10. Do you understand the parallelism of Flink? What is the parallelism setting of Flink?

A Flink program consists of multiple tasks (Source, Transformation, Sink). Tasks are divided into multiple parallel instances for execution, and each parallel instance processes a subset of the task's input data. The number of parallel instances of a task is called the degree of parallelism.

In the actual production environment, we can set the degree of parallelism from four different levels:

  • Operator Level: operator.setParallelism(3), set when the actual operator
  • Execution Environment Level: set by getExecutionEnvironment.setParallelism(1) when building the Flink environment
  • Client level (Client Level): set when submitting flink run -p
  • System level (System Level): set in the configuration yml file of the flink client

Priorities that need attention : operator level > environment level > client level > system level (in actual business, the parallelism is usually set to be the same as the number of Kafka partitions or a multiple of Kafka partitions).

Flink can set several levels of parallelism, including Operator Level, ExecutionEnvironment Level, Client Level, System Level

Specify the system-level default parallelism for all execution environments through the parallelism.default configuration item in flink-conf.yaml;

In the ExecutionEnvironment, you can set the default parallelism for operators, data sources, and data sinks through setParallelism;

If operators, data sources, and datasinks have their own parallelism set, they will override the parallelism set by ExecutionEnvironment.

11. Where does Flink's Checkpoint exist?

Flink's Checkpoint is one of the core components of Flink. It is used to record the state of the application at a specific moment so that it can be recovered when the application fails. Checkpoint is usually stored in Flink's storage system, which can be memory, file system or RocksDB.

1. Memory

Flink's in-memory state is stored in Java memory. While the application is running, Flink stores state data in memory and periodically persists these state data to an external storage system. If the application fails at runtime, Flink can restore the application's state from the in-memory state.

2. File system

Flink can also store state data in the file system. When an application is running, Flink writes state data to a distributed file system such as HDFS or NFS. If the application fails at runtime, Flink can restore the state of the application from the file system.

3. RocksDB

Flink can also store state data in RocksDB. RocksDB is a high-performance, high-reliability key-value store database that supports efficient data compression and fast lookups. When the application is running, Flink will write the state data to the RocksDB database, and periodically persist the state data to the external storage system. If the application fails at runtime, Flink can restore the state of the application from RocksDB.

In short, Flink's Checkpoint can be stored in memory, file system or RocksDB, and the specific storage location is determined by user configuration. Flink provides some APIs to manage Checkpoint, such as checkpointCoordinator.checkpoint() method and checkpointCoordinator.restoreFromCheckpoint() method. Using these APIs, users can manually trigger checkpoints, or automatically restore state when the application fails.

12. What are the differences and advantages of Flink's checkpoint mechanism compared with spark?

Both Flink and Spark are mainstream big data processing frameworks, and they both support the Checkpoint mechanism to ensure the reliability and fault tolerance of real-time data. However, the Checkpoint mechanism of Flink and Spark has some differences in implementation and function.

1. Implementation method

Flink's Checkpoint mechanism adopts lightweight distributed snapshot technology to realize the snapshot of each operator and the snapshot of the flowing data. This snapshot technology can quickly save and restore state data, thereby reducing the time for failure recovery. However, Spark's checkpoint mechanism mainly performs checkpoints of data and metadata for driver failure recovery, and does not implement snapshots of operators.

2. Failure recovery

Flink's Checkpoint mechanism can support failure recovery of any node, including operators and drivers. When a node fails, Flink will automatically switch to other available nodes and restore state data from the latest Checkpoint. However, Spark's checkpoint mechanism can only recover the failure of the driver, and the entire application needs to be restarted for the failure of the operator.

3. Data Consistency

Flink's Checkpoint mechanism can guarantee data consistency, that is, all operators under the same Checkpoint are in the same state. This is because Flink uses distributed snapshot technology to ensure that each operator saves the same state data. However, Spark's checkpoint mechanism cannot guarantee data consistency, because in Spark, each operator may store different state data.

4. Performance Impact

Flink's Checkpoint mechanism uses lightweight distributed snapshot technology, so its performance impact is relatively small. Spark's Checkpoint mechanism needs to save the state data of the entire application to an external storage system, so its performance impact is relatively large.

In general, Flink's Checkpoint mechanism is more complex and powerful than Spark's Checkpoint mechanism, which can support failure recovery of any node and ensure data consistency. In addition, Flink's Checkpoint mechanism uses lightweight distributed snapshot technology, so its performance impact is relatively small. These advantages make Flink have better reliability and fault tolerance in real-time data processing.

13. What are the commonly used operators in Flink?

Flink is a stream processing framework that provides a wealth of operators for data processing and transformation. Here are some common operators:

  1. Map operator : maps each element in a data stream to another element. The Map operator is one of the most basic operators in Flink. It accepts a mapping function as a parameter, which maps input data to output data.
  2. Filter operator : maps each element in a data stream to multiple elements. The Filter operator filters out elements that do not meet the conditions according to the specified conditions, and only outputs elements that meet the conditions.
  3. KeyBy operator : group the data stream according to the specified key. The KeyBy operator groups the elements in the data stream according to the specified key, and aggregates the elements in each group together.
  4. Window operator : perform window operations on data streams. The Window operator can specify parameters such as the type, size, and sliding mode of the window, and perform window operations on the data stream, such as rolling windows, sliding windows, and session windows.
  5. Reduce operator : performs a reduction operation on the elements in the data stream, and merges multiple elements into one element. The Reduce operator accepts an aggregation function as an argument, which aggregates input data into output data.
  6. Aggregate operator : aggregates elements in the data stream. The Aggregate operator is similar to the Reduce operator, but it can specify multiple aggregation functions, and supports both local aggregation and global aggregation.
  7. Join operator : connect elements in the data stream. The Join operator can specify parameters such as the connection method, the connection key, and the connection condition to connect two data streams together.

In addition to the above operators, Flink also provides many other operators, such as Union, HashJoin, Sort, Limit, etc., to achieve more complex data processing and analysis scenarios.

14. How does Flink's stream processing deal with delay?

Flink's streaming can handle latency in the following ways:

Event time processing: Flink supports event time processing, which can handle out-of-order events, sort and process data according to event time, so as to solve the delay problem.

Window operation: Flink's window operation can divide and process the data stream according to the event time or processing time, and the window size and sliding interval can be set as needed to control the delayed processing.

15. What kinds of window support does Flink include? Talk about their usage scenarios

Flink supports two ways of dividing windows. According to time and count, session is also a kind of time.

  • Tumbling Time Window (rolling time window) : When a certain time is reached, slide it, which can remind you of the Nokia slider phone you used before. This is actually a micro-batch. It is used to process time series data in real-time data streams, such as stock price trends, real-time monitoring traffic, etc. If we need to count the total number of products purchased by users every minute, we need to segment the user's behavior events by each minute. This segmentation is called a tumbling time window (Tumbling Time Window). The tumbling window can divide the data stream into non-overlapping windows, and each event can only belong to one window.
  • Sliding Time Window (sliding time window) : When a certain time is reached, roll over and overlap. It is used to process the recent analysis of time series data, such as the total number of items purchased by users in the last 5 minutes. We can calculate the total number of items purchased by the user in the last minute every 30 seconds. We call this window the sliding time window (Sliding Time Window). In sliding windows, one element can correspond to multiple windows.
  • Tumbling Count Window : When a certain number of items is reached, the calculation is performed without folding. It is used to process statistical data, such as counting website visits, analyzing user purchase behavior, etc. When we want to count the total number of purchases for every 100 user purchase behavior events, then every time the window is filled with 100 elements, the window will be calculated. This kind of window is called Tumbling Count Window. The size of the window shown in the figure above is 3.
  • Sliding Count Window (sliding count window) : slide when a certain number is reached. It is used for real-time analysis of statistical data, such as real-time monitoring of advertising click-through rate, real-time statistics of votes, etc.
  • Session Window (session window) : The window data has no fixed size, it is divided according to the parameters passed in by the user, and the window data is not superimposed. It is used to process the data in the user interaction event stream, such as calculating the total number of items purchased by each user during the active period, etc. Similar to when the user logs out, calculate the user's previous actions. In this stream of user interaction events, our first thought was to aggregate events into session windows (periods of sustained user activity), separated by gaps of inactivity. As shown in the figure above, it is necessary to calculate the total number of items purchased by each user during the active period. If the user has no activity for 30 seconds, the session is considered disconnected (assuming that the raw data stream is a single user's purchase behavior stream). In general, a window defines a finite set of elements on an infinite stream. This set can be based on time, number of elements, combination of time and number, session interval, or custom. Flink's DataStream API provides concise operators to meet common window operations, and provides a general window mechanism to allow users to define their own window allocation logic.

16. Which third-party integrations does Flink support?

Answer: Flink supports integration with a variety of third-party tools and frameworks, including:

  • Apache Kafka : Flink can be seamlessly integrated with Kafka as a data source and data sink.
  • Apache Hadoop : Flink can be integrated with Hadoop, can read data in the Hadoop file system, and can also write processing results to the Hadoop file system.
  • Apache Hive : Flink can be integrated with Hive, and can read data in Hive tables for processing and analysis.
  • Apache HBase : Flink can be integrated with HBase and can read and write data in HBase.
  • Elasticsearch : Flink can be integrated with Elasticsearch, and the processing results can be written to Elasticsearch for real-time search and analysis.

17. What are the data sources and data receivers of Flink?

Flink supports a variety of data sources and data sinks, including:

  • Data source : Data can be read from data sources such as file systems, Kafka, and message queues, and converted into data streams for processing.
  • Data receiver : The processing results can be output to data receivers such as file systems, databases, and Kafka, or sent to downstream processing links.

Flink has some basic data sources and sinks built in, which are always available. The predefined data sources include files, directories and Sockets, and can load collection and iterator data. This predefined data sink supports writing to files, outputting messages and exceptions.

18. What batch operations does Flink support?

Flink supports a variety of batch operations, including:

  • Map : Applies the specified function to each element in the dataset.
  • Reduce : Perform a reduction operation on the data set to reduce the data to a result.
  • Filter : Filters the elements in the dataset according to the specified criteria.
  • Join : Join two data sets according to the specified key.
  • GroupBy : Group the data set according to the specified key.

19. How to switch between Flink's stream processing and batch processing?

Flink can seamlessly switch between stream processing and batch processing, mainly thanks to its event time-based window processing mechanism and flexible job scheduling strategy. Flink provides two types of jobs: batch jobs and stream jobs.

1. Batch job :

Batch jobs process data as bounded datasets, similar to traditional batch jobs. In batch mode, Flink divides the data into batches, and then processes each batch offline. Batch jobs are usually used in scenarios such as processing historical data or periodically generating statistical reports.

To run batch jobs, users need to upload data as batch files to Flink's distributed file system (such as HDFS or local file system), and then process it through Flink jobs. In a batch job, the user can specify the data deadline (data before the deadline will be processed), as well as the concurrency of the job and other parameters.

2. Stream processing jobs :

Stream processing jobs process data as unbounded streams, process data in real-time and produce real-time results. In stream processing mode, Flink will receive data in real time and assign it to different tasks for processing. Stream processing jobs are usually used in scenarios such as real-time data processing, real-time analysis, and real-time monitoring.

To run stream processing jobs, users need to configure data sources (such as Kafka, Flume, etc.) and Flink clusters, and then process them through Flink jobs. In stream processing jobs, users can specify data processing time windows, triggers and other parameters to meet real-time data processing requirements.

In Flink, switching between batch jobs and stream jobs can be done by modifying job configuration files. For example, modify the batch.file.pathand streaming.file.pathparameters to specify the input data path for batch and streaming jobs. In addition, users can also view and manage the status of the job through the Flink Web UI to ensure the correct operation of the job.

20. Why use Flink instead of Spark?

The advantages of Flink over Spark are mainly reflected in the following aspects:

  1. Low latency and high throughput : Flink is an event-driven stream computing framework that supports low latency and high throughput data processing. Flink's low-latency feature benefits from its time-window-based scheduling mechanism, which can support millisecond-level latency. At the same time, Flink's high throughput is also one of its advantages, and it can support tens of millions of data processing per second.
  2. Better support for streaming data application scenarios : Flink focuses on streaming data processing and can better support streaming data application scenarios, such as real-time computing, real-time monitoring, and real-time recommendation. And Spark is more suitable for batch data processing, such as offline analysis, batch reporting, etc.
  3. Ability to handle out-of-order data : Flink can handle out-of-order data very well, and can automatically handle the problem of inconsistent data order during data processing. However, Spark requires additional configuration and processing when processing out-of-order data.
  4. Ensure exactly-once state consistency : Flink can guarantee exactly-once state consistency, that is, each event will be processed once and only once. However, Spark has the problem of repeated processing when processing data, and additional optimization and configuration are required to ensure state consistency.

To sum up, Flink has advantages over Spark in terms of low latency, high throughput, support for streaming data application scenarios, processing out-of-order data, and ensuring state consistency. Therefore, it is favored by more and more companies and developers. use.

21. Talk about Flink's fault tolerance mechanism. How does Flink achieve fault tolerance?

Flink is a distributed stream processing framework that implements a fault-tolerant mechanism to ensure that data will not be lost and can be recovered when a node fails. Flink's fault tolerance mechanism mainly relies on two powerful mechanisms: Checkpoint and State.

  • Checkpoint : It is a snapshot mechanism that is used to periodically back up the state in the Flink program and store it in an external storage system. When a node fails, Flink can use Checkpoint to restore the state of the program and continue processing the data stream from the point of failure. Checkpoint backup can be full or incremental, depending on the triggering conditions and backup strategy of Checkpoint. Flink also supports Exactly-Once semantics, which means that when recovering from a failure, Flink can ensure that each event is processed once and only once.
  • State : It is another important mechanism in Flink, which is used to store the intermediate state in the calculation process. State can be divided into two types: Operator State and Keyed State. Operator State is an operator-based state, which is stored inside the operator and updated as the operator executes. Keyed State is a key-based state that is stored inside a Stateful Function and uses keys to identify the state's data. Keyed State can have an expiration time (TTL), which enables Flink to automatically clean up expired state data when the state expires.

In Flink, Checkpoint and State are interdependent. Checkpoint is used to back up State and ensure that the state of the program can be restored when a node fails. The State is used to store the intermediate state in the calculation process and supports Exactly-Once semantics. Through the combination of these two mechanisms, Flink has achieved strong fault tolerance and fault recovery capabilities, making Flink highly reliable and available in distributed stream processing.

22. How does Flink achieve efficient network data exchange?

Flink achieves high efficiency in network data exchange, mainly due to the following aspects:

  1. Distributed data exchange : Flink uses a distributed computing model based on JobGraph, and data can interact in different tasks. This distributed data exchange enables Flink to make full use of multiple nodes in the cluster to process large-scale data streams, thereby improving the parallelism and throughput of the entire system.
  2. TaskManager is responsible for data interaction : In Flink, TaskManager is responsible for managing Task execution and data interaction. TaskManager will collect Records from the buffer (Buffer), and then send it to other Tasks. This centralized data management method can reduce the number of network connections, thereby improving network throughput.
  3. Batch encapsulation : The batch (Batching) mechanism in Flink can encapsulate multiple Records together to form a batch (Batch). Batch encapsulation can greatly reduce the number of network connections, because network I/O is a scarce resource in distributed scenarios. Reducing the number of network connections can improve system throughput and concurrency. In fact, in the analysis of Kafka source code, we can also see that Kafka uses a similar record encapsulation mechanism to improve throughput.
  4. Network congestion control : Flink also uses a congestion control mechanism during network data exchange to avoid network overload. When the network bandwidth of a certain node is too high, Flink will reduce the data output rate of the node to alleviate network congestion, thereby ensuring the stable operation of the entire system.
  5. Adaptive network topology : Flink supports adaptive network topology, which can dynamically adjust the routing strategy for data exchange according to the number and location of nodes in the cluster. This adaptive network topology can improve the performance and reliability of the system because it can better utilize the network resources in the cluster.

To sum up, Flink achieves high efficiency in network data exchange, mainly through mechanisms such as distributed data exchange, TaskManager responsible for data exchange, batch encapsulation, network congestion control, and adaptive network topology. These mechanisms enable Flink to have high throughput, high concurrency, and high reliability when processing large-scale data streams.

23. How does the Flink program deal with the peak period of data?

When a Flink program faces a data peak period, a common method is to use a large-capacity Kafka as a data source, put the data in the message queue first, and then use Flink for consumption. This method can effectively shave peaks and flat valleys, slow down the impact of data traffic on Flink programs, thereby improving the stability and reliability of programs.

However, using Kafka as a data source will affect a little real-time performance. Because Kafka is an asynchronous message queue, data needs to wait for consumers to consume in the queue, so there will be a certain delay. In order to solve this problem, the following methods can be adopted:

  1. Adjust the parameters of Kafka, such as increasing the cache size of Kafka, increasing the number of concurrent consumers of Kafka, etc., to improve the throughput and processing capacity of Kafka.
  2. Optimize the configuration of Flink programs, such as increasing the parallelism of Flink, adjusting the memory configuration of Flink, etc., to improve the processing power and throughput of Flink.
  3. Use Stateful Functions or Checkpointing functions in Flink to maintain data consistency and reliability. Stateful Functions can make Flink programs state-aware of data processing, so as to better handle events in data streams. The Checkpointing function allows the Flink program to periodically persist the intermediate state to the external storage system when processing data, so as to recover when the program fails.

To sum up, using Kafka as a data source can effectively handle data peak periods, but you need to pay attention to the configuration optimization of Kafka and Flink, as well as the real-time and consistency issues of data processing.

24. What is the principle of Flink distributed snapshot?

The core part of Flink's fault tolerance mechanism is to make consistent snapshots of distributed data streams and operator states. These snapshots act as consistent checkpoints, and the system can be rolled back in the event of a failure. The mechanism Flink uses to make these snapshots is described in "Lightweight Asynchronous Snapshots of Distributed Data Streams". It is inspired by the standard Chandy-Lamport algorithm for distributed snapshots and tailored specifically for Flink's execution model.

Barriers are injected into parallel data streams at the data stream source. The position where the barriers of snapshot n are inserted (we call it Sn) is the largest position in the data source for the data contained in the snapshot.

For example, in Apache Kafka, this position would be the offset of the last record in the partition. Report the position Sn to the checkpoint coordinator (Flink's JobManager).

The barriers then flow downstream. When an intermediate operator receives barriers for snapshot n from all its input streams, it emits barriers for snapshot n into all its output streams.

Once a sink operator (the end of a streaming DAG) has received barriers n from all its input streams, it acknowledges to the checkpoint coordinator that snapshot n is complete.

After all sinks confirm the snapshot, it means that the snapshot is complete. Once the snapshot n is completed, the job will never ask the data source for records before Sn, because at this time these records (and their subsequent records) will have passed through the entire data flow topology, that is, they have been processed.

25. The difference between Flink and Spark Streaming

This question is a very macro question, because there are so many differences between the two frameworks. But there is a very important point that must be answered during the interview: Flink is a standard real-time processing engine based on event-driven. Spark Streaming is a Micro-Batch model.

Below we introduce the main differences between the two frameworks in several aspects:

  1. Architecture model The main roles of Spark Streaming at runtime include: Master, Worker, Driver, and Executor. Flink mainly includes: Jobmanager, Taskmanager, and Slot at runtime.
  2. Task scheduling Spark Streaming continuously generates small batches of data to build a directed acyclic graph DAG, and Spark Streaming will create DStreamGraph, JobGenerator, and JobScheduler in turn. Flink generates a StreamGraph based on the code submitted by the user, optimizes it to generate a JobGraph, and then submits it to the JobManager for processing. The JobManager generates an ExecutionGraph based on the JobGraph. The ExecutionGraph is the core data structure for Flink scheduling. The JobManager schedules jobs based on the ExecutionGraph.
  3. Time Mechanism Spark Streaming supports a limited time mechanism, only processing time. Flink supports three definitions of time for stream processing programs: processing time, event time, and injection time. At the same time, it also supports the watermark mechanism to deal with lagging data.
  4. Fault Tolerance Mechanism For Spark Streaming tasks, we can set a checkpoint, and then if a failure occurs and restarts, we can recover from the last checkpoint, but this behavior can only prevent data from being lost, and may be processed repeatedly, and cannot be processed exactly once semantics. Flink uses a two-phase commit protocol to solve this problem.

26. Talk about several time semantics of Flink

Flink supports three time semantics: Event Time, Ingestion Time and Processing Time.

1. Event Time

Event Time is the time when the event was created, which is usually described by a timestamp in the event. Typically generated by event generators or sensors. In Flink, event time can be handled by water-mark or timer. For example, when collecting log data, each log will record its own generation time, and Flink accesses the event timestamp through the timestamp allocator. Event Time is the time when the event is generated and has nothing to do with the time of data processing, so it can reflect the real-time nature of event generation, but it cannot reflect the delay and asynchrony of data processing.

2. Ingestion Time

Ingestion Time is the time when data enters Flink. It refers to the time when the data is processed by the Flink operator, and has nothing to do with the time when the event is created. Ingestion Time can reflect the delay and asynchrony of data processing, but cannot reflect the real-time nature of events.

3. Processing Time

Processing Time is the local system time of each operator that performs time-based operations, and is related to the machine. It refers to the time when the operator processes the data, and has nothing to do with the time when the event is created and the time when the data enters Flink. Processing Time is the default time attribute, unless you explicitly specify the time semantics as Event Time or Ingestion Time.

In practical applications, choosing the appropriate time semantics can affect the correctness and efficiency of the data stream processed by Flink.

For example, if you need to process real-time data streams, it is more appropriate to choose Event Time;

If you need to deal with delayed data flow, it is more appropriate to choose Ingestion Time;

If you need to process offline data sets, it is more appropriate to choose Processing Time.

At the same time, Flink also provides a WaterMark mechanism to process delayed data and asynchronous data to ensure the correctness and reliability of data processing.

27. Talk about the Watermark mechanism in Flink

The Watermark mechanism in Flink is a mechanism to measure the progress of Event Time, which can be used to deal with out-of-order events. During data stream processing, data may arrive out of order due to various factors such as network delay and back pressure. In order to correctly handle these out-of-order events, Flink introduces the Watermark mechanism, which is implemented in combination with Window.

Watermark is a timestamp, used to indicate that the data with event time less than or equal to the timestamp has arrived. In Flink, each Operator maintains a current Watermark. When an event arrives, if its timestamp is less than or equal to the current Watermark, the event is considered to have arrived and will be placed in the window for processing. The execution of the window is triggered by the Watermark. When the Watermark reaches the end time of the window, the window will be triggered and the calculation logic in it will be executed.

In order to realize the correct processing of the window, Flink also introduces the concept of event time (Event Time), each event will carry a timestamp, indicating the time when the event was generated. During data stream processing, Flink processes events in the order of their timestamps, which ensures the correct order of events. However, due to network delays, back pressure, and other reasons, events may arrive out of order, which requires the use of the Watermark mechanism to handle these out of order events.

To sum up, the Watermark mechanism in Flink is a mechanism for dealing with out-of-order events. It can set a delayed trigger to indicate that all data with an event time less than or equal to the timestamp has arrived. By combining the window mechanism, the Watermark mechanism can realize the correct processing of out-of-order events and ensure the correctness and integrity of the data flow.

28. How does Flink do stress testing and monitoring?

If the speed of the generated data flow is too fast and the downstream operators cannot consume it, back pressure will be generated. For backpressure monitoring, you can use the Flink Web UI to visually monitor Metrics, and you will know once an alarm is issued. Under normal circumstances, it may be because the sink operator has not been optimized well, and it is enough to do some optimization.

Set the parameter of the maximum delay time of watermark. If it is set too large, it may cause memory pressure. You can set the maximum delay time to be smaller, then send late elements to the test output stream, and update the results later.

In addition, if the length of the sliding window is too large and the sliding distance is very short, the performance of Flink will also drop severely. Each element can be stored in only one "overlapping window" by means of sharding, which can reduce the writing of state in window processing.

29. What mechanism is the back pressure mechanism implemented by Flink?

Flink is mainly composed of operators and streams at runtime. Each operator consumes a stream of intermediate states, performs transformations on the stream, and generates a new stream. A vivid analogy to Flink's network mechanism is that Flink uses an efficient and bounded distributed blocking queue, just like the blocking queue (BlockingQueue) adopted by Java. With BlockingQueue, a slower receiver will slow down the sending rate of the sender, because once the queue is full (bounded queue) the sender will be blocked.

In Flink, these distributed blocking queues are these logical flows, and the queue capacity is realized through the buffer pool (LocalBufferPool). Each stream that is produced and consumed is assigned a buffer pool. The buffer pool manages a set of buffers (Buffer), which can be recycled after being consumed.

30. How does Flink deal with back pressure? How to monitor and discover?

Flink's backpressure (Backpressure) means that when the output speed of an Operator is slower than the input speed of its downstream Operator, the downstream Operator may accumulate a certain amount of data, resulting in slower processing speed or even blockage. In order to solve this problem, Flink introduces a back pressure mechanism to detect and solve the problem of data processing speed mismatch in time.

Flink does not use complex mechanisms when dealing with backpressure, but adopts a simple and efficient method. Flink uses a distributed blocking queue during data transmission, which effectively solves the backpressure problem.

In Flink, when the output speed of an operator is faster than that of the downstream operator, Flink will use the distributed blocking queue to cache the output data. This can prevent upstream operators from generating data too quickly, causing downstream operators to fail to process it in time, thus forming back pressure. When a downstream operator needs data, it takes the data from the queue and processes it. When the data in the queue reaches a certain threshold, the upstream operator will be notified, thereby slowing down the data generation speed. In this way, Flink achieves backpressure relief through distributed blocking queues.

In addition, Flink achieves backpressure relief through communication between each TaskManager and JobManager. When downstream processing tasks take too long, Flink detects this and considers this a signal of backpressure. At this point, Flink will pass this backpressure signal to the manager of the upstream task.

Specifically, Flink's backpressure strategy is mainly divided into the following steps:

  1. Task backpressure : Flink detects when downstream tasks are processing slowly and considers this a signal of backpressure. At this point, Flink will pass this backpressure signal to the manager of the upstream task.
  2. Adjust data generation speed : When the upstream task manager receives the backpressure signal, it will adjust the data generation speed according to the strength of the backpressure signal. Typically, the stronger the backpressure signal, the less data generated by upstream tasks to ease the burden on downstream tasks.
  3. Control backpressure : Flink also uses some control mechanisms to avoid excessive backpressure. For example, when the data generation rate of upstream tasks is too slow, Flink will limit the intensity of backpressure to avoid excessive data backlog. In addition, Flink will also set a back pressure threshold. When the back pressure signal exceeds this threshold, Flink will consider that the task is in an unstable state, and will take corresponding measures, such as adjusting the parallelism of the task, suspending the task, and so on.
  4. Restore data generation speed : When the processing speed of downstream tasks returns to normal levels, Flink will detect this change and gradually increase the data generation speed of upstream tasks to restore data flow.

The data generation speed of upstream tasks can be dynamically adjusted according to the processing speed of downstream tasks to alleviate the problem of data backlog. This strategy can improve the processing efficiency and stability of Flink tasks in practical applications.

Flink's backpressure monitoring and discovery is mainly carried out in the following ways:

  1. Flink Web UI : Flink Web UI is a web-based user interface for managing and monitoring a Flink cluster. In the Flink Web UI, users can view the running status of jobs, task management information, and backpressure status. Specifically, on the "Jobs" page, users can view the Backpressure status of each job, including OK, LOW and HIGH. In addition, in the "Task Managers" page, users can also view the heartbeat information and backpressure status of each TaskManager.
  2. Flink command line tools : In addition to the Web UI, users can also use the command line tools provided by Flink (such as "flink" and "jobmanager") for backpressure monitoring. For example, using the "jobmanager" command, a user can view detailed information about a job, including task status and backpressure status.
  3. Third-party monitoring tools : In addition to Flink's own monitoring tools, there are also some third-party Flink monitoring tools that can help users monitor backpressure status. For example, Apache Kafka provides a tool called "Kafka Console Consumer" for viewing consumption of Kafka topics. Through this tool, users can understand the speed of data production to determine whether there is a backpressure problem.
  4. Custom monitoring and alarm : In order to monitor the back pressure state more real-time and accurately, users can write custom monitoring and alarm scripts. These scripts can periodically obtain the status information of the Flink cluster and send alarm notifications according to preset rules. For example, when it is found that the backpressure state of an Operator is HIGH, an alarm email can be automatically sent to relevant personnel.

In short, Flink helps users monitor and discover backpressure problems in real time through various methods such as Web UI, command line tools, third-party monitoring tools, and custom monitoring and alarming, so as to ensure efficient and stable data processing.

31. The Window in Flink has data skew, what solution do you have?

Window operation in Flink is a data processing method based on time window, which can be used in application scenarios such as statistical analysis, monitoring, and real-time computing. However, when the amount of data is too large or the speed of data transmission is uneven, the amount of data accumulated in the window may vary too much, that is, data skew occurs.

Data skew will have a negative impact on Flink's performance, because window calculations require aggregation operations on all data, and data skew will cause the amount of data in some windows to be too large, thereby increasing calculation time and resource consumption. In order to solve the problem of data skew, the following methods can be adopted:

  1. Do pre-aggregation before the data enters the window : This method can perform certain aggregation operations before the data enters the window, so that the amount of data in each window is relatively uniform. The specific approach can be to perform pre-aggregation at the data source, or use the window aggregation functions (such as Tumbling Windows and Sliding Windows) in the DataStream API in Flink to perform pre-aggregation.
  2. Key redesign of window aggregation : In some cases, the key of window aggregation may need to be redesigned to avoid data skew. For example, the key can be designed as the timestamp when the data is sent, rather than some attribute of the data itself. This can make the amount of data in the window more uniform, thereby avoiding data skew.
  3. Adjust window parameters : In some cases, data skew can be avoided by adjusting window parameters. For example, the size of the window may be increased or the sliding interval of the window may be increased to make the amount of data in the window more uniform.
  4. Use Flink's Ttl operation : The Ttl operation in Flink can perform elimination operations based on the timestamp of the data when the data reaches the window, thereby avoiding data skew. The specific method is to set a Ttl time. When the data arrives in the window, if the timestamp of the data has exceeded the Ttl time, the data will be eliminated to avoid data skew.

To sum up, to solve the problem of window data skew in Flink needs to be analyzed and processed according to the specific situation. You can use methods such as pre-aggregation, redesigning the key of window aggregation, adjusting window parameters, or using Flink's Ttl operation to avoid data skew, thereby improving Flink's performance and reliability.

32. When using the KeyBy operator, the data volume of a certain Key is too large, resulting in data skew, how to deal with it?

When using the KeyBy operator, if the data volume of a key is too large, it will cause data skew and affect the calculation efficiency. In order to solve this problem, the following methods can be considered:

  1. Hash the Key and convert the Key into the form of Key-random number, so as to ensure data hashing and perform aggregate statistics on the scattered data. At this time, we will get the statistical results of the original Key plus the random number.
  2. Remove the concatenated random numbers from the hashed Key to obtain the original Key, and then perform the second KeyBy to calculate the result. This ensures that data skew will not affect the final result.

33. How to solve the data hotspot after Flink uses the aggregation function

Flink has the problem of data hotspots after using aggregation functions, mainly because some aggregation functions have a relatively large amount of calculation, resulting in slow data processing speed, resulting in data backlogs and delays. In this case, the following methods can be used to solve the data hotspot problem:

  1. Increase computing resources : Increase computing nodes and memory resources to improve the computing power of the Flink cluster, thereby speeding up data processing and reducing data backlogs and delays.
  2. Adjust the aggregation function parameters : Some aggregation functions require a large amount of calculation. You can consider adjusting the parameters of the aggregation function to reduce the amount of calculation and improve the data processing speed. For example, parameters such as the size of the window or the sliding interval can be adjusted.
  3. Use batch processing : batch data is processed at a certain time interval to reduce the pressure of real-time processing, thereby reducing data backlog and delay. For example, you can use Flink's Batch operation for batch processing.
  4. Employ a data deduplication strategy : In some cases, data hotspots may be due to duplication of some data. Data deduplication strategies can be adopted, such as using the Checkpointing operation in Flink, so as to avoid data hot spots caused by duplicate data.
  5. Adjust data source parameters : In some cases, data source parameter settings can cause data hot spots. You can adjust the parameters of the data source, such as the interval time for sending data, data compression, etc., so as to reduce the problem of data hotspots.

To sum up, solving the data hotspot problem in Flink needs to be analyzed and processed according to the specific situation. Data hot spots can be solved by increasing computing resources, adjusting aggregation function parameters, using batch processing, adopting data deduplication strategies, or adjusting data source parameters, so as to improve the performance and reliability of Flink.

34. Flink tasks have high latency. How would you solve this problem?

If the latency of Flink tasks is high, optimization should be done from the following aspects:

  1. Resource tuning : First check the resource usage of the Flink cluster. If you find that the resource usage of some nodes is too high, you can consider increasing the number of nodes or adjusting the resource configuration of the nodes, such as increasing memory and CPU. In addition, the resource allocation strategy of the task manager can also be adjusted, such as using idle nodes first.
  2. Operator tuning : If the task latency is high, you can consider adjusting the parameters of the operator, such as the window duration and the number of concurrency. The shorter the window length, the greater the amount of calculations, which may lead to increased latency, so it needs to be adjusted according to the specific situation. At the same time, more efficient operators can be considered. For example, the parallelism of Reducer can be adjusted to a factor of taskNumber.
  3. Data optimization : Data optimization is an important means to improve the performance of Flink tasks. Technologies such as data compression, data filtering, and data deduplication can be considered to reduce the amount of data and computation. At the same time, technologies such as batch processing and Checkpointing can also be considered to optimize the data processing process.
  4. Task scheduling optimization : Task scheduling optimization is also an important means to improve the performance of Flink tasks. You can consider using Flink's own schedulers, such as FairScheduler, DynamicTaskAllocation, etc. These schedulers can allocate tasks and managers according to different strategies. In addition, you can also use custom schedulers, such as schedulers based on priority and resource usage, to optimize task scheduling.
  5. Error handling : If a task has an error, it can cause increased latency. Therefore, it is necessary to set correct error handling strategies, such as using try-catch statements, setting error handling delays, etc., to avoid delays caused by errors.

To sum up, to solve the problem of high latency of Flink tasks, it is necessary to start with resource optimization, operator optimization, data optimization, task scheduling optimization, and error handling to improve the performance and reliability of Flink tasks.

35. How does Flink guarantee Exactly-once semantics?

If the underlying storage supports transactions :

Flink can achieve end-to-end consistency semantics by implementing two-phase commit and state preservation.

Divided into the following steps:

  1. Start the transaction (beginTransaction) to create a temporary folder to write and write data into this folder
  2. Pre-commit (preCommit) writes the data cached in memory to the file and closes
  3. Formal submission (commit) puts the previously written temporary files into the target directory. This means that there will be some delay in the final data
  4. Abort (abort) discard temporary files
  5. If the failure occurs after the pre-submission is successful, but before the official submission. The pre-submitted data can be submitted according to the status, and the pre-submitted data can also be deleted.

Subordinate storage does not support transactions :

End-to-end exactly-once has relatively high requirements for sinks. The specific implementation mainly includes idempotent write and transactional write. Scenarios for idempotent writes rely on business logic, more commonly with transactional writes. There are two methods of transactional writing: write-ahead log (WAL) and two-phase commit (2PC).

If the external system does not support transactions, you can use the write-ahead log method to save the result data as a state, and then write it to the sink system at one time when receiving the checkpoint completion notification.

36. Talk about the status of Flink

In Flink, state refers to a mechanism for storing and processing data during real-time computing. State can be divided into two basic types: KeyedState and OperatorState.

  • KeyedState : It is the state based on the key (Key), usually related to the operation of KeyedStream. KeyedState contains two basic states: ValueState and MapState. ValueState is used to store the state of a single value, while MapState is used to store mapping relationships. In actual production, ValueState and MapState in KeyedState are usually used.
  • OperatorState : It is based on the operator (Operator) state, usually related to non-KeyedStream operations. OperatorState can store the internal state of the operator, such as window state, accumulator, etc.

Both KeyedState and OperatorState are state types in Flink, and they play an important role in real-time computing. KeyedState is usually used to process key-based data, such as counting and aggregating a key; OperatorState is usually used to process non-key-based data, such as windowing and sorting data.

When learning the state in Flink, you need to understand the basic concepts, classification, usage and related concepts of state management. At the same time, it is necessary to master how to use KeyedState and OperatorState in the program in order to process data in real-time calculations.

37. Talk about Flink's state storage mechanism

Flink's state storage refers to the data structure and storage system used to store and manage the operator state during the running of the Flink program. Flink provides a variety of state backends to suit different application scenarios and needs. Here we will describe Flink's state storage in detail, including state backends before version 1.13 and state backends after version 1.13.

Prior to version 1.13:

  • MemoryStateBackend : used during development. This is an in-memory stateful backend for fast debugging and testing of Flink programs during development. Since it uses memory to store state data, it is suitable for scenarios with small state data.
  • FsStateBackend : Used in production, commonly used. This is a filesystem-based state backend that stores state data on disk. FsStateBackend provides a high-availability state backup and recovery mechanism to ensure that the state can be restored in the event of a task failure.
  • RocksDBStateBackend : used in production, for very large state. This is a RocksDB-based state backend that uses the RocksDB database to store state data. RocksDB is a key-value storage system that supports efficient compression and fast lookup, and is suitable for scenarios of processing large-scale state data.

After version 1.13:

  • HashMapStateBackend : MemoryStateBackend and FsStateBackend, depending on the API. Starting from version 1.13, Flink has integrated the state backend, merging MemoryStateBackend and FsStateBackend into a unified HashMapStateBackend. It uses HashMap data structure to store state data, and provides some additional functions, such as snapshot, checkpoint, etc.
  • EmbeddedRocksDBStateBackend : Used in production, for very large state. This is a RocksDB-based state backend, but unlike RocksDBStateBackend, it embeds the RocksDB database into Flink's TaskManager process. The advantage of this is that when the state data is large, it can reduce network overhead and improve access performance.
    In short, Flink's state storage system includes a variety of state backends to suit different application scenarios and needs. Developers can choose the appropriate state backend according to the actual situation to achieve efficient and reliable Flink programs.

38. Introduce Flink's CEP mechanism

Flink's CEP (Complex Event Processing, complex event processing) mechanism is mainly used to process complex events in real-time data streams, so as to calculate and respond to these events in real time. Different from the traditional batch processing method, the CEP mechanism can process events in the real-time data stream, and perform real-time calculation and response according to the complex logic of the event.

Flink CEP is a complex event processing (CEP) library implemented in Flink. CEP allows the detection of event patterns in an endless stream of events, giving us the opportunity to grasp important parts of the data. One or more event streams composed of simple events are matched by certain rules, and then output the data that users want—complex events that satisfy the rules.

Flink's CEP mechanism mainly relies on two core components: Flink's stream processing framework and CEP library. Flink's stream processing framework provides low-latency, high-throughput data stream processing capabilities and can handle massive amounts of real-time data. The CEP library provides the logic for processing complex events, and can implement functions such as event filtering, aggregation, and routing. Through the combination of these two components, Flink can realize real-time processing and response to complex events in real-time data streams.

Flink's CEP mechanism has the following characteristics:

  1. Real-time : Flink's CEP mechanism can process events in real-time data streams, and calculate and respond to these events in real time with very low latency.
  2. Flexibility : The CEP library provides flexible event processing logic, which can define event processing methods according to specific business requirements, such as filtering, aggregation, routing, etc.
  3. Scalability : Flink's stream processing framework has excellent horizontal scalability, and can dynamically increase or decrease computing resources according to the scale of data flow and processing requirements.
  4. High availability : Flink's CEP mechanism supports failure recovery, which can automatically recover when the application fails to avoid data loss and impact.
  5. Stream processing : Flink's CEP mechanism uses stream processing to process real-time data streams, and can calculate and respond to events in real time without collecting all data first and then performing batch processing.

Flink's CEP mechanism can be widely used in finance, Internet of Things, logistics and other industries in practical applications, such as: real-time calculation of stock transaction data, real-time monitoring of sensor data, real-time routing of logistics information, etc. Understanding Flink's CEP mechanism helps to better deal with complex event processing requirements in real-time data streams.

39. In Flink CEP programming, where will the data be saved when the state has not arrived?

In Flink CEP programming, when the state is not reached, the data is usually kept in memory. This is because in stream processing, CEP needs to support EventTime, which also needs to support the late arrival of data, which requires the use of Watermark mechanism to deal with it. The processing of unmatched event sequences is similar to late data. In the processing logic of Flink CEP, the unsatisfied and late data will be stored in a Map data structure.

This way of storing data in memory is necessary when dealing with delayed data, but it may indeed cause some damage to the memory. In order to reduce memory usage, the following strategies can be adopted:

  1. Reasonably set the time interval of the state : according to the business requirements and the actual situation of data processing, reasonably set the time interval of the state to reduce the amount of data stored in the memory.
  2. Use an external state store : Store state data in an external state store, such as Redis, HBase, etc., to reduce memory pressure.
  3. Optimize the CEP algorithm : optimize the CEP algorithm so that it can use memory more effectively and reduce memory usage when processing delayed data.
  4. Reasonably set the parallelism of Flink : According to the actual hardware resources and data processing requirements, reasonably set the parallelism of Flink to balance the relationship between memory usage and processing speed.

In Flink CEP programming, when the state has not arrived, the data will be saved in memory. In order to reduce memory usage, strategies such as setting the state time interval reasonably, using external state storage, optimizing the CEP algorithm, and setting Flink parallelism reasonably can be adopted.

40. What is the parallelism of Flink? What is the parallelism setting of Flink?

The parallelism of Flink refers to the number of data flow fragments that can be processed simultaneously when executing an operator. By setting the degree of parallelism, you can make full use of multiple TaskSlots (task slots) in the cluster to execute multiple data flow fragments, thereby improving computing performance. The concept of parallelism is easy to understand. For example, Kafka Source, its parallelism defaults to the number of its partitions.

The parallelism setting of Flink can be adjusted through the internal parameters of the operator or external configuration.

Here are some common setup methods:

  1. For built-in operators, such as Map, Filter, Reduce, etc., you can set the degree of parallelism through the parameters of the operator function. For example, in the Map operator, map_function.parallelismthe parameter can be used to set the degree of parallelism.
  2. For custom operators, ParallelismAwarethe degree of parallelism can be set by implementing the interface. In the process of implementing this interface, you need to implement get_parallelismthe method, which returns the parallelism of the operator.
  3. In Flink's config file (eg flink-config.yaml) it is possible to set the degree of parallelism for the entire task. For example, you can use parallelism.task.numthe parameter to set the number of TaskSlots, thereby affecting the parallelism of the operator.
    In general, we should set the degree of parallelism according to the amount of data. For source operators (such as Kafka Source, HDFS Source, etc.), their parallelism can usually be consistent with the number of partitions, because source operators usually do not generate too much data. For intermediate operators (such as Map, Filter, etc.), the degree of parallelism can be adjusted appropriately according to the size of the data. For aggregation operators (such as Reduce, Aggregate, etc.) and connection operators (such as Join, etc.), the degree of parallelism usually needs to be considered comprehensively according to the size of the data volume and the pressure of the operator.

Reasonably setting the degree of parallelism can give full play to Flink's parallel computing advantages and improve the performance of data processing.

41. Talk about Flink's partition strategy

Flink provides a variety of partitioning strategies to meet the needs of different data processing. The following is a detailed description:

  1. GlobalPartitioner : The first instance that sends data to the downstream operator. This kind of partitioner is suitable for situations where only one instance is required to process data.
  2. ShufflePartitioner : Randomly distribute data to downstream operators. This kind of partitioner is suitable for situations where data needs to be randomly distributed during data processing, such as data deduplication or data obfuscation.
  3. RebalancePartitioner : An instance that sends data cyclically to downstream. This kind of partitioner is suitable for situations where the data needs to be cyclically processed during data processing, such as data cleaning or data conversion.
  4. RescalePartitioner : According to the parallelism of upstream and downstream operators, loop output to downstream operators. This kind of partitioner is suitable for situations where data distribution needs to be performed according to the parallelism of operators during data processing, such as data aggregation or data filtering.
  5. BroadcastPartitioner : output to each instance of the downstream operator. This partitioner is suitable for situations where data needs to be broadcast to all instances during data processing, such as data sources or data collections.
  6. ForwardPartitioner : The upstream and downstream operators have the same parallelism. This partitioner is suitable for situations where the parallelism of upstream and downstream operators needs to be kept consistent during data processing, such as data windows or data sorting.
  7. KeyGroupStreamPartitioner : Output to downstream operators according to the Hash value of the Key. This kind of partitioner is suitable for data partitioning according to the hash value of Key during data processing, such as data grouping or data summarization.
  8. KeyedStream : Partitioned according to the keyGroup index number, the data will be output to the downstream operator instance according to the Hash value of the Key. This partitioner is not provided to users, but is used internally by Flink.
  9. CustomPartitionerWrapper : User-defined partitioner. This kind of partitioner requires users to implement the Partitioner interface to define their own partition logic. Applicable to situations where data partitions need to be partitioned according to specific logic during data processing.

Flink provides a variety of built-in partitioners to meet common data processing needs, and also supports user-defined partitioners to meet specific needs.

42. What is Task Solt?

TaskSlot is a concept used in Flink to control the number of tasks received by TaskManager. It is an abstract concept that represents the number of tasks that a TaskManager can handle. In Flink, TaskManager is the working node that actually executes the program. In order to play the role of resource isolation and parallel execution, TaskManager is a JVM process. Through the concept of TaskSlot, the number of tasks received by TaskManager can be controlled, so as to better utilize cluster resources.

When a source needs to specify three degrees of parallelism, it needs to use three TaskSlots. This is because the number of TaskSlots determines the number of tasks that the TaskManager can handle. If the source needs three degrees of parallelism, then TaskManager needs three TaskSlots to handle the tasks of these three degrees of parallelism.

Another important optimization concept is that when the parallelism of operators is the same, and there is no parallelism change, or no shuffle, these operators will be merged together. The purpose of this is to reduce resource consumption and improve computing efficiency.

43. What is the relationship between Flink's Slots and parallelism?

Solt is a concept in TaskManager, which means a slot (Slot) of TaskManager. Parallelism is a concept in the program, indicating the degree of parallel execution of the program. In Flink, Solt has a close relationship with parallelism.

Specifically, Solt is the resource allocation unit of TaskManager, which determines the parallelism that TaskManager can support. A TaskManager has multiple Solts, and each Solt can be assigned to a Task for program execution. Therefore, the parallelism of TaskManager is equal to the number of its Solts.

The parallelism specified by the program uses slots (Solt), that is to say, the program controls the parallelism by assigning Solt. When a program needs a higher degree of parallelism, it can apply for more Solts from TaskManager so that more tasks can be executed at the same time.

Therefore, the relationship between Solt and parallelism can be summarized as follows: Solt is a concept in TaskManager, which determines the parallelism that TaskManager can support; parallelism is a concept in a program, which is controlled by assigning Solt. TaskManager is the provider, which provides Solt resources for the program to use; the program is the consumer, which controls the degree of parallelism by allocating Solt.

44. Talk about Flink's resource scheduling

Flink's resource scheduling is based on the concepts of TaskManager and Task slot. TaskManager is the smallest scheduling unit in Flink, responsible for managing and scheduling tasks. The Task slot is the most fine-grained resource in TaskManager, representing a fixed-size resource subset. Each TaskManager will equally share the resources it occupies to its slots. By adjusting the number of task slots, users can define how tasks are isolated from each other.

Each TaskManager has a slot, which means that each task runs in an independent JVM. The advantage of this is that the isolation between tasks is clearer, and a problem with one task will not affect other tasks. At the same time, a separate JVM can provide better resource management and garbage collection.

And when TaskManager has multiple slots, multiple tasks can run in the same JVM. The advantage of this is that the TCP connection (based on multiplexing) and the heartbeat message can be shared, thereby reducing the network transmission of data. In addition, tasks in the same JVM process can also share some data structures, thereby reducing the consumption of each task.

In Flink, each slot can accept a single task, or a pipeline composed of multiple consecutive tasks. For example, the FlatMap function occupies a taskslot, while the key Agg function and the sink function share a taskslot. This flexible resource scheduling method can be optimized and configured according to different task requirements to improve system resource utilization and performance.

In short, Flink's resource scheduling is realized through the concepts of TaskManager and Task slot. By adjusting the number and allocation of task slots, it can meet the needs of different tasks and improve the resource utilization and performance of the system.

As shown in the figure below, the FlatMap function occupies a taskslot, while the key Agg function and the sink function share a taskslot:

45. Is there a restart strategy in Flink?

The restart strategy in Flink is used to restart the operator to restore the running of the program when a failure occurs during the running of the program. The restart policy can be configured in flink-conf.yaml or dynamically specified in the application code.

The following are the four common restart strategies in Flink:

  1. Failure Delay Restart Strategy (Failure Delay Restart Strategy): When an operator fails, the strategy will wait for a fixed time interval (ie delay time) before restarting the operator. If within the delay time, the same operator fails again, the delay time is recalculated and set to twice the previous value. This process is repeated until the delay reaches a maximum value (usually 60 seconds), at which point Flink will give up restarting the operator and mark it as permanently failed.
  2. Failure Rate Restart Strategy: This strategy decides whether to restart the operator based on the failure rate of the operator. When the failure rate of an operator exceeds a preset threshold, Flink restarts the operator. This strategy is suitable for operators that may fail frequently due to data anomalies or program bugs.
  3. No restart strategy (No Restart Strategy): When an operator fails, this strategy does not restart the operator, but directly skips the operator and continues to execute the following operators. This strategy is suitable for operators that can be ignored after failure, such as those that only output data.
  4. Fallback restart strategy (Fallback Restart Strategy): When an operator fails, this strategy will try to restart the operator. If the restart fails, it will fall back to the previous version, that is, the operator will not be restarted. This strategy is suitable for those operators that may fail due to program upgrades, so that they can roll back to the previous version when restarting fails.

If checkpointing is not enabled, the no restart strategy is used. If checkpointing is enabled, but no restart strategy is configured, a fixed-delay strategy is used. In the fixed-interval strategy, Flink waits for a fixed time interval before restarting failed operators. This interval can be set through the restart.delay configuration item in flink-conf.yaml.

46. ​​What should I do if Flink encounters an abnormal restart of the program?

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

47. What is the function of Flink's distributed cache? how to use?

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

48. Talk about broadcast variables in Flink. What should you pay attention to when using broadcast variables?

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

49. Talk about Flink Operator Chains

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

50. Under what circumstances will Flink combine Operator chains to form an operator chain?

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

51. How to implement Flink serialization

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

52. Does Flink need to depend on Hadoop?

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

53. What are the Flink component stacks?

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

54. Which machine learning and graph processing libraries does Flink support?

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

55. Talk about Flink runtime components

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

56. How many layers can Flink's API be divided into?

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

57. How many kinds of UDFs are there in Flink used in the table API?

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

58. Talk about the implementation principle of Flink SQL

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

59. Talk about the Flink task submission process

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

60. What are the common submission modes of Flink-On-Yarn, and what are their advantages and disadvantages?

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

61. Talk about the execution graph of Flink

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

62. Talk about Flink's CBO, logical execution plan and physical execution plan

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

63. What is Flink's global snapshot? Why do you need global snapshots?

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

64. How to do Flink dimension table association

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

65. How does Flink deduplicate massive keys

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

66. Talk about Flink's RPC

Omitted here due to character limit

For the complete content, please refer to "Nin's Big Data Interview Collection", pdf and get it from Nin

say later

This article "Nin's Big Data Interview Collection" is a companion article to "Nin's Java Interview Collection".

Here is a special explanation: Since the first release of the 41 topic PDFs of "Nin's Java Interview Collection", thousands of questions have been collected . I joined a big factory and got a high salary . Therefore, the collection of interview questions in "Nin Java Interview Collection" has become a must-read book for Java learning and interviews.

Therefore, the Nien architecture team struck while the iron was hot and launched the "Neon Big Data Interview Collection" , which has released three topics:

" Nin's Big Data Interview Collection Topic 1: The Most Complete Hadoop Interview Questions in History "

" Nin's Big Data Interview Collection Topic 2: Top Secret 100 Spark Interview Questions, Memorized 100 Times, Get a High Salary "

" Nin's Big Data Interview Collection Topic 3: The Most Complete Hive Interview Questions in History, Continuously Iterating and Continuously Upgrading "

"Nin's Big Data Interview Collection Topic 4: The Most Complete Flink Interview Questions in History, Constantly Iterating and Continuously Upgrading" (this article)

The complete pdf can be obtained at the official account [Technical Freedom Circle] at the end of the article.

Moreover, "Neon's Big Data Interview Collection" and "Neon's Java Interview Collection" will continue to iterate and update to absorb the latest interview questions

PDF release notice: "Big Data Flink Study Bible"

Future career, how to break through: grow into a three-dimensional architect, Java architecture + GO architecture + big data architecture

Nien is about to write for you, "Big Data Flink Study Bible" "Big Data HBASE Study Bible"

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/132199368