Big Data Flink Study Bible: A book to realize the freedom of big data Flink

Learning objectives: Three-dwelling-in-one architect

This article is the V1 version of "Big Data Flink Study Bible", which is the companion article of "Nin's Big Data Interview Collection".

Here is a special explanation: Since the first release of the 5 topic PDFs of "Nin's Big Data Interview Collection", hundreds of questions have been collected, and a large number of large-scale interviews are dry and genuine . The collection of interview questions in "Nin's Big Data Interview Collection" will become a must-read book for big data learning and interviews.

Therefore, the Nien architecture team launched the "Big Data Flink Study Bible" and "Big Data HBASE Study Bible" while the iron was hot.

"Big Data Flink Study Bible" will continue to be upgraded and iterated in the future, and it will become a must-read book for studying and interviewing in the field of big data .

In the end, help everyone grow into a three-in-one architect , enter a large factory, and get a high salary.

PDF of "Nin Architecture Notes", "Nin High Concurrency Trilogy" and " Nin Java Interview Collection ", please go to the official account [Technical Freedom Circle] to get it

"Java+Big Data" Amphibious Architecture Success Case

Successful case 1:

Shocking counterattack: Unemployed for 4 months, 3-year-old guy likes to propose a structure offer for 1 month, and he is older and cross-line, super awesome

Successful case 2:

Quickly get offers: Ali P6 landed quickly after being laid off, and within 1 month, I would like to mention 2 high-quality offers (including Didi)

Article directory

1. What is Flink

Big Data

Big data (Big Data) refers to a large-scale, diverse structure and fast-growing data collection. These data collections often contain data that traditional database management systems cannot effectively handle and are highly complex and challenging. The main characteristics of big data include three dimensions**: three V**, namely Volume (large amount of data), Variety (data diversity), and Velocity (data speed).

  1. Large volume of data (Volume) : One of the most obvious characteristics of big data is its huge data volume. Traditional data processing methods and tools may become inefficient or infeasible when dealing with data of this scale.
  2. Data diversity (Variety) : Big data includes not only structured data (such as tabular data), but also semi-structured data (such as JSON, XML) and unstructured data (such as text, images, audio, video, etc.). These data may come from different sources and in different formats.
  3. Velocity : Big data is often generated, flowed and accumulated at a high rate. This requires the data processing system to be able to process data in real time or near real time in order to obtain valuable information from it.

Distributed Computing

With the development of computer technology and the increase of data scale, the processing power and storage capacity of a single computer gradually become limited, which cannot meet the requirements of big data processing. In order to meet this challenge, distributed computing came into being, which uses multiple computers to form a cluster, divides computing tasks into multiple subtasks and executes them in parallel on different computing nodes, thereby improving computing efficiency and processing power.

The core idea of ​​distributed computing is to divide large problems into small problems, distribute tasks to multiple computing nodes for parallel execution, and finally combine the results to obtain the final solution. This method effectively solves the problem that a single computer cannot handle large-scale data and high concurrent computing. At the same time, distributed computing also has good scalability, and can flexibly expand the size of the cluster according to the increase in the amount of data to meet the challenges of growing data.

The concept of distributed computing sounds very profound, but the idea behind it is very simple, that is, divide and conquer, also known as divide and conquer (Divide and Conquer). Divide and conquer is an algorithm design strategy for solving problems. It decomposes a problem into multiple identical or similar sub-problems, then solves these sub-problems separately, and finally combines the solutions of the sub-problems to obtain the solution of the original problem. Divide and conquer is often used to solve complex problems, especially in big data processing, where large-scale data collections can be divided into smaller parts, and then these parts are processed separately, and finally the results are combined.

When dealing with big data problems, you can use the idea of ​​divide and conquer to improve efficiency and scalability. Here are some examples of applying divide and conquer to deal with big data problems:

  1. MapReduce mode : The classic application of the divide-and-conquer method is the MapReduce mode, which divides a large-scale data collection into multiple small blocks, and each small block is processed by a different computing node, and then the results are combined. This approach is suitable for batch processing tasks such as data cleaning, transformation, aggregation, etc.

  2. Parallel computing : Decompose large-scale computing tasks into multiple small tasks, assign them to different computing nodes for parallel processing, and finally combine the results. This is suitable for computationally intensive problems such as numerical simulations, graph algorithms, etc.

  3. Distributed sorting : Divide a large-scale data set into multiple parts, each part is sorted on a different computing node, and then use the merge sort algorithm to merge these ordered parts into an overall ordered data set.

  4. Partition and sharding : In a distributed storage system, data partitions and shards can be stored on different nodes, and data can be distributed to different storage nodes through partition keys or hash functions, thereby realizing distributed storage of data and management.

  5. Distributed machine learning : Decompose large-scale machine learning tasks into multiple subtasks, train them separately in a distributed computing environment, and then combine model parameters, such as distributed stochastic gradient descent algorithm.

  6. Data splitting and merging : For large data sets that require frequent access, the data can be split into multiple small blocks, each of which is stored on a different storage node, and then merged as needed to reduce the overhead of data access.

The application of divide and conquer in big data processing not only helps to improve processing efficiency, but also can make full use of distributed computing and storage resources, so as to better deal with the volume and complexity of big data. However, when applying the divide-and-conquer method, issues such as appropriate data segmentation strategy, task scheduling, and result merging need to be considered to ensure the correctness and performance of the divide-and-conquer method.

However, distributed computing also brings some challenges, such as data consistency, communication overhead, task scheduling, etc. It is necessary to consider various factors to design and optimize distributed systems. At the same time, distributed computing also requires developers to have the knowledge and skills of distributed system design and tuning to ensure the performance and stability of the system.

distributed storage

When the amount of data is huge and stand-alone storage can no longer meet the demand, distributed storage and distributed file system become the key technologies for processing big data. Below I will introduce the concepts, characteristics and common implementations of distributed storage and distributed file systems in detail.

Distributed storage:

Distributed storage is a data storage solution that disperses and stores data on multiple nodes to provide high capacity, high performance, high reliability and scalability. Each node can access data over the network, and multiple nodes work together to handle data requests. The core goal of distributed storage is to solve the bottleneck of stand-alone storage while providing high reliability and availability.

The characteristics of distributed storage include:

  • Horizontal scalability : The storage capacity and performance can be expanded by adding nodes to adapt to the ever-increasing data volume and load.
  • High reliability and fault tolerance : Data is redundantly stored on multiple nodes. When a node fails, the data is still available and will not be lost.
  • Data distribution and replication : Data is distributed on different nodes according to a certain strategy, and data replication ensures data availability and fault tolerance.
  • Concurrent access and high performance : Support multiple clients to access data at the same time to achieve high concurrency and better performance.
  • Flexible data model : supports multiple data types and access methods, such as file system, object storage, key-value storage, etc.

Distributed file system:

Distributed file system is a special type of distributed storage, mainly used to store and manage file data. It provides an interface similar to the traditional stand-alone file system, but in the underlying implementation, data is distributed and stored on multiple nodes. Distributed file systems can automatically handle data distribution, replication, consistency, and failure recovery.

Common distributed file system characteristics include:

  • Namespace and path : The distributed file system accesses files through paths, similar to the directory structure of traditional file systems.
  • Data distribution and replication : Files are split into chunks and distributed across multiple nodes, while data is replicated for redundancy and high availability.
  • Consistency and data consistency model : Distributed file systems need to ensure data consistency, and data copies on different nodes need to be kept in sync.
  • Access control and rights management : Provide user and application access control and rights management functions to ensure data security.
  • High performance : Distributed file systems usually optimize data read and write performance to meet the needs of big data scenarios.
  • Scalability : Storage capacity and performance can be expanded by adding nodes.

Common distributed file systems include:

  • Hadoop HDFS (Hadoop Distributed File System) : A distributed file system in the Hadoop ecosystem, suitable for big data storage.
  • Ceph : An open source distributed storage system that provides block storage, file system and object storage.
  • GlusterFS : An open source distributed file system that can linearly scale storage capacity and performance.

In short, distributed storage and distributed file systems play an important role in the era of big data, helping us store, manage and access massive amounts of data, and solve challenges that traditional stand-alone storage cannot handle.

batch and stream processing

Batch processing and stream processing are two common data processing modes in the field of big data processing, which are used for different types of data processing needs. The following will introduce these two modes in detail, and give examples of relevant application scenarios.

Batch Processing:

Batch processing refers to bringing together a batch of data and processing and analyzing this batch of data within a fixed time interval. Batch processing is usually suitable for scenarios with large data volumes, long processing cycles, and high consistency requirements.

Features:

  • Data is processed centrally and is suitable for periodic analysis and report generation.
  • The data is divided into small pieces, and each piece is processed in a job.
  • Data processing takes a long time and is not suitable for scenarios with high real-time requirements.

Application scenario example:

  1. Offline data analysis : Analyze historical data to discover trends, patterns, and laws for business decision-making. For example, sales data analysis, user behavior analysis.
  2. Batch recommendation system : based on user historical behavior data, regularly generate recommendation results. For example, movie recommendation, product recommendation.
  3. Data cleaning and preprocessing : Clean, filter and preprocess large-scale data to improve data quality and availability. For example, cleaning invalid data, filling missing values.
  4. Large-scale ETL (Extract, Transform, Load) : Extract data from the source system, transform and process it, and load it into the target system. For example, the construction of a data warehouse.

Stream Processing:

Stream processing refers to processing data immediately when it is generated to realize real-time processing and analysis of data. Stream processing is usually suitable for scenarios that require high real-time data and fast response.

Features:

  • Data flows in real time and requires fast processing and response.
  • Data arrives continuously and requires real-time calculation and analysis.
  • You may experience issues such as latency and out-of-order data.

Application scenario example:

  1. Real-time monitoring and alarming : monitor and analyze real-time data, detect abnormalities in time and trigger alarms. For example, network traffic monitoring, system performance monitoring.
  2. Real-time data analysis : Real-time analysis of streaming data to extract valuable information from it. For example, real-time clickstream analysis, real-time market analysis.
  3. Real-time recommendation system : Generate recommendation results in real time based on user real-time behavior data. For example, news recommendation, advertisement recommendation.
  4. Real-time data warehouse : build a real-time data warehouse to integrate, process and analyze real-time data. For example, real-time sales data analysis, real-time user behavior analysis.

In short, batch processing and stream processing are suitable for different types of data processing requirements, and the appropriate processing mode should be selected according to business needs and real-time requirements.

Open source big data technology

Hadoop, YARN, Spark, and Flink are all important technologies when talking about big data processing. They all belong to the distributed computing framework in the field of big data, but they are different in function and usage.

Hadoop:

Hadoop is an open source distributed storage and computing framework originally developed by Apache to process large-scale data sets. The core components of Hadoop include:

  1. Hadoop Distributed File System (HDFS) : HDFS is a distributed file system used to store large-scale data. It divides data into chunks and distributes these chunks across different nodes in the cluster. HDFS supports high reliability, redundant storage and data replication.

  2. MapReduce : MapReduce is Hadoop's computational model for processing distributed data. It divides computing tasks into two phases, Map and Reduce, and distributes them on nodes in the cluster for parallel execution. The Map stage is responsible for data splitting and processing, and the Reduce stage is responsible for data aggregation and calculation.

YARN(Yet Another Resource Negotiator):

YARN is the resource manager of Hadoop, which is responsible for the management and allocation of cluster resources. YARN divides cluster resources into containers (Containers) and allocates them to different applications. This isolation and management of resources allows multiple applications to run concurrently on the same Hadoop cluster, improving resource utilization and multi-tenancy capabilities of the cluster.

Spark:

Apache Spark is a general-purpose distributed computing engine designed to provide high performance, ease of use, and versatility. Compared to traditional Hadoop MapReduce, Spark has faster execution speed because it loads data into memory and performs in-memory calculations. Spark supports multiple computing modes, including batch processing, interactive query, stream processing, and machine learning.

Key features and components of Spark include:

  1. RDD (Resilient Distributed Dataset) : RDD is the core data abstraction of Spark, which represents a distributed dataset. RDD supports parallel operations and fault tolerance, and can recalculate lost partitions during computation.

  2. Spark SQL : Spark SQL is a component for processing structured data that supports SQL queries and operations. It can seamlessly integrate RDD with traditional data sources such as Hive.

  3. Spark Streaming : Spark Streaming is a module for processing real-time streaming data and supports micro-batch processing mode. It is capable of splitting real-time data streams into small batches and processing them.

  4. MLlib : MLlib is Spark's machine learning library that provides common machine learning algorithms and tools for training and evaluating models.

  5. GraphX : GraphX ​​is Spark's graph computing library for processing graph data and graph algorithms.

Considerable:

Apache Flink is a stream processing engine and distributed batch processing framework with low latency, high throughput and fault tolerance. Flink supports stream-batch integration, enabling seamless switching between real-time stream processing and batch jobs. Its core features include:

  1. DataStream API : Flink's DataStream API is used to process real-time streaming data, supporting event time processing, window operation and state management. It is capable of handling high-throughput real-time data streams.
  2. DataSet API : Flink's DataSet API is used for batch processing jobs, similar to Hadoop's MapReduce. It supports rich operators and optimization techniques.
  3. Stateful Stream Processing : Flink supports stateful stream processing, which can save and manage state during processing. This is useful for implementing complex data processing logic.
  4. Event Time Processing : Flink supports event time processing, which can handle out-of-order events and accurately calculate the results of window operations.
  5. Table API and SQL : Flink provides Table API and SQL queries, enabling developers to use SQL-like syntax to query and analyze data.
  6. Can connect various components of the big data ecosystem, including Kafka, Elasticsearch, JDBC, HDFS and Amazon S3
  7. It can run on Kubernetes, YARN, Mesos and Standalone clusters.

Several main advantages of Flink in stream processing are as follows:

  1. Real streaming computing engine : Flink has a better streaming computing model, which can perform very efficient state operations and window operations. Spark Streaming is still a micro-batch engine.

  2. Lower latency : Flink can achieve millisecond-level low-latency processing, while Spark Streaming has a higher latency.

  3. Better fault tolerance mechanism : Flink supports finer-grained state management and checkpoint mechanism, which can achieve exactly once state consistency semantics. Spark is more difficult to ensure exactly once.

  4. Support for finite and infinite data streams : Flink can handle finite data streams with a start and end, as well as infinite and growing data streams. Spark Streaming is better suited for limited datasets.

  5. Easier to unify batch processing and stream processing : Flink provides DataStream and DataSet APIs, which can easily unify batch processing and stream processing. Spark needs to be used in conjunction with Spark SQL.

  6. Better memory management : Flink has its own memory management, which can optimize memory usage according to different queries. Spark relies on Hadoop YARN for resource scheduling.

  7. 更高性能:在部分场景下,Flink拥有比Spark Streaming更高的吞吐和低的延迟。

总体来说,Flink作为新一代流处理引擎,在延迟、容错、易用性方面优于Spark Streaming。但Spark生态更加完善,也在努力减小与Flink的差距。需要根据具体场景选择最优的框架。

总的来说,Flink在流处理领域的优势主要体现在事件时间处理、低延迟、精确一次语义和状态管理等方面。这些特性使得Flink在处理实时流数据时能够更好地满足复杂的业务需求,特别是对于需要高准确性和可靠性的应用场景。

2. Flink 部署

Apache Flink在1.7版本中进行了重大的架构重构,引入了Master-Worker架构,这使得Flink能够更好地适应不同的集群基础设施,包括Standalone、Hadoop YARN和Kubernetes等。下面会详细介绍一下Flink 1.7版本引入的Master-Worker架构以及其在不同集群基础设施中的适应性。

Master-Worker架构:

Flink 1.7版本中引入的Master-Worker架构是为了解决之前版本中存在的一些问题,如资源管理、高可用性等。在这个架构中,Flink将任务管理和资源管理分离,引入了JobManager和ResourceManager两个主要角色。

  • JobManager:负责接受和调度任务,维护任务的状态和元数据信息,还负责处理容错机制。JobManager分为两种:JobManager(高可用模式)和StandaloneJobManager(非高可用模式)。

  • ResourceManager:负责管理集群中的资源,包括分配任务的资源、维护资源池等。

这种架构的优势在于解耦任务的管理和资源的管理,使得Flink能够更好地适应不同的集群环境和基础设施。

兼容性:

Flink's Master-Worker architecture design makes it compatible with almost all mainstream information system infrastructure, including:

  • Standalone cluster : In Standalone mode, both Flink's JobManager and ResourceManager run in the same process, which is suitable for simple development and testing scenarios.

  • Hadoop YARN cluster : Flink can be deployed on an existing Hadoop YARN cluster, and interact with YARN ResourceManager through ResourceManager to realize resource management.

  • Kubernetes cluster : Flink also supports deployment in Kubernetes clusters, and manages tasks and resources through the resource management capabilities provided by Kubernetes.

This compatibility enables Flink to run flexibly in different cluster environments to meet the needs of different scenarios.

In short, the Master-Worker architecture introduced by Flink in version 1.7 makes it perform better in terms of resource management and high availability, and also enables Flink to better adapt to various cluster infrastructures, including Standalone, Hadoop YARN and Kubernetes, etc. This brings more flexibility and options to the deployment and use of Flink.

Standalone cluster is a simple deployment mode in Apache Flink, suitable for development, testing and small-scale application scenarios. Below I will introduce the characteristics and deployment methods of the Standalone cluster in detail.

Features of Standalone cluster:

  1. Simple deployment : Standalone cluster is one of the simplest deployment modes of Flink. It does not need to rely on other cluster management tools and can be deployed on a single machine.

  2. Resource sharing : The JobManager and TaskManager in the Standalone cluster share the same resources, such as memory and CPU. This makes resource management relatively simple, but can also affect the performance of tasks when resources are contended for.

  3. Applicable to development and testing : Standalone clusters are suitable for development and testing phases, and can simulate the Flink cluster environment on a local machine, which is convenient for developers to debug and test.

  4. Does not support high availability : Standalone clusters do not support high availability by default, that is, they do not have the ability to recover from failures and migrate tasks. If high availability is required, it can be achieved by running multiple JobManager instances.

Deployment method of Standalone cluster:

  1. Install Flink : First, you need to download and install Flink. You can download precompiled binaries from the official website and unzip them to a specified directory. It can also be downloaded from:

    Apache-flink installation package download_Open Source Mirror Station-Alibaba Cloud (aliyun.com)

  2. Configure Flink : Enter the Flink installation directory and modify conf/flink-conf.yamlthe configuration file. The main configuration items include jobmanager.rpc.addressand and taskmanager.numberOfTaskSlotsso on.

  3. Start JobManager : Open the terminal, enter the Flink installation directory, and execute the following command to start JobManager:

    ./bin/start-cluster.sh
    
  4. Start TaskManager : Open the terminal, enter the Flink installation directory, and execute the following command to start TaskManager:

    ./bin/taskmanager.sh start
    
  5. Submit a job : Submit a job using the Flink client tool. Jobs in a JAR file can be submitted using the following command:

    ./bin/flink run -c your.main.Class ./path/to/your.jar
    
  6. Stop the cluster : You can stop the entire Standalone cluster with the following command:

    ./bin/stop-cluster.sh
    

In short, the Standalone cluster is a simple and easy-to-deploy Flink cluster mode, suitable for development, testing and small-scale application scenarios. However, due to its characteristics of resource sharing and not supporting high availability, it is not suitable for deployment in a production environment.

The following provides a simple cluster of flink standalone deployed using Docker.

Docker deploys flink simple cluster

Flink programs can run as a distributed system within a cluster, and can also be deployed in standalone mode or under YARN, Mesos, Docker-based environments, and other resource management frameworks.

1. Create a flink directory on the server

mkdir  flink

The structure of the directory is as follows:

2. docker-compose.yml script creation

The orchestration file of the docker container, as follows

3. Start flink

(1) Running in the background

It is generally recommended to use this option in a production environment.

docker-compose up -d

(2) Running in the foreground

The console will print the output information of all containers at the same time, which is very convenient for debugging.

docker-compose up

4. View the page dashboard on the browser

Access the web interface

http://cdh1:8081/

3. Fast application of Flink

​ Through a case of word statistics, quickly get started with the application of Flink for streaming (Streaming) and batch processing (Batch)

Practical operation 1: word statistics case (batch data)

1.1 Requirements

Count the number of occurrences of each word in a file, and output the statistical results to the file

Steps:
1. Read the data source
2. Process the data source

a. Split each line in the read data source file according to spaces

b. Splicing each segmented word 1

c. Aggregation based on words (putting the same words together)

d. Accumulate the same words (the 1 after the word is accumulated)

3. Save the processing results

1.2 Code implementation

  • Introduce dependencies
<!--flink核心包-->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-java</artifactId>
    <version>1.7.2</version>
</dependency>
<!--flink流处理包-->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-java_2.12</artifactId>
    <version>1.7.2</version>
    <scope>provided</scope>
</dependency>	
  • java program
package com.crazymaker.bigdata.wordcount.batch;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.AggregateOperator;
import org.apache.flink.api.java.operators.FlatMapOperator;
import org.apache.flink.api.java.operators.UnsortedGrouping;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

/**
   * 1、读取数据源
   * 2、处理数据源
   *  a、将读到的数据源文件中的每一行根据空格切分
   *  b、将切分好的每个单词拼接1
   *  c、根据单词聚合(将相同的单词放在一起)
   *  d、累加相同的单词(单词后面的1进行累加)
   * 3、保存处理结果
   */
public class WordCountJavaBatch {
    
    
    public static void main(String[] args) throws Exception {
    
    
        String inputPath="D:\\data\\input\\hello.txt";
        String outputPath="D:\\data\\output\\hello.txt";

        //获取flink的运行环境
        ExecutionEnvironment executionEnvironment = ExecutionEnvironment.getExecutionEnvironment();
        DataSet<String> text = executionEnvironment.readTextFile(inputPath);
        FlatMapOperator<String, Tuple2<String, Integer>> wordOndOnes = text.flatMap(new SplitClz());

        //0代表第1个元素
        UnsortedGrouping<Tuple2<String, Integer>> groupedWordAndOne = wordOndOnes.groupBy(0);
        //1代表第1个元素
        AggregateOperator<Tuple2<String, Integer>> out = groupedWordAndOne.sum(1);

        out.writeAsCsv(outputPath, "\n", " ").setParallelism(1);//设置并行度
        executionEnvironment.execute();//人为调用执行方法

    }

    static class SplitClz implements FlatMapFunction<String,Tuple2<String,Integer>>{
    
    

        public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
    
    
            String[] s1 = s.split(" ");
            for (String word:s1) {
    
    
                collector.collect(new Tuple2<String,Integer>(word,1));//发送到下游

            }

        }
    }
}

the content of the source file

statistical results

Practical operation 2: Word statistics case (stream data)

nc

netcat:

When developing flink, sockets are often used as sources; when developing in a linux/mac environment, you can open nc -l 9000 in the terminal (start the netcat program, serve as a server, and send data);

nc is the abbreviation of netcat, which has the reputation of the Swiss army knife in the network industry. Because it is short and powerful, functional and practical, it is designed as a simple and reliable network tool.

nc role

  • data transmission
  • file transfer
  • Network speed measurement between machines

2.1 Requirements

Socket simulates sending words in real time,

Use Flink to receive data in real time, aggregate and count the data within a specified time window (such as 5s), summarize and calculate every 1s, and print out the calculation results within the time window.

2.2 Code implementation

package com.crazymaker.bigdata.wordcount.stream;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

/**
 * 	Socket模拟实时发送单词,使用Flink实时接收数据
 */
public class WordCountStream {
    
    

    public static void main(String[] args) throws Exception {
    
    
        // 监听的ip和端口号,以main参数形式传入,约定第一个参数为ip,第二个参数为端口
//        String ip = args[0];
        String ip = "127.0.0.1";
//        int port = Integer.parseInt(args[1]);
        int port = 9000;
        // 获取Flink流执行环境
        StreamExecutionEnvironment streamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        // 获取socket输入数据
        DataStreamSource<String> textStream = streamExecutionEnvironment.socketTextStream(ip, port, "\n");

        SingleOutputStreamOperator<Tuple2<String, Long>> tuple2SingleOutputStreamOperator = textStream.flatMap(new FlatMapFunction<String, Tuple2<String, Long>>() {
    
    
            public void flatMap(String s, Collector<Tuple2<String, Long>> collector) throws Exception {
    
    
                String[] splits = s.split("\\s");
                for (String word : splits) {
    
    
                    collector.collect(Tuple2.of(word, 1l));
                }
            }
        });

        SingleOutputStreamOperator<Tuple2<String, Long>> word = tuple2SingleOutputStreamOperator.keyBy(0)
                .sum(1);
        // 打印数据
        word.print();
        // 触发任务执行
        streamExecutionEnvironment.execute("wordcount stream process");

    }
}

Summary of the process of Flink program development

The process of Flink program development is summarized as follows:

1) Obtain an execution environment

2) Load/create initialization data

3) Specify the operator of the data operation

4) Specify the storage location of the result data

5) Call execute() to trigger the execution program

Note: The Flink program is deferred calculation, and only when the execute() method is called at the end will the actual execution of the program be triggered

4. Flink distributed architecture and core components

Flink job submission process

The job submission process in standalone mode is as follows:

Before a job is submitted, processes such as Master and TaskManager need to be started.

We can execute scripts in the Flink main directory to start these processes:

bin/start-cluster.sh。

After the Master and TaskManager are started, the TaskManager needs to register itself with the ResourceManager in the Master.

This initialization and resource registration process occurs before a single job is submitted, which we call step 0.

Next, we will analyze the submission process of Flink jobs step by step. The specific steps are as follows:

① The user writes the application code and submits the job using the Flink client (Client). Usually, these programs are written in Java or Scala, and call the Flink API to build a logical view. These codes and related configuration files are compiled and packaged, and then submitted to the Dispatcher of the Master node to form an application job (Application).

② After the Dispatcher receives the submitted job, it will start a JobManager, which is responsible for coordinating the tasks of the job.

③ JobManager applies to ResourceManager for the required job resources, which may include CPU, memory, etc.

④ Since in the previous steps, TaskManager has registered available resources with ResourceManager, the TaskManager that is idle at this time will be assigned to JobManager.

⑤ JobManager converts the logical view in the user's job into a physical execution graph, as shown in Figure 3-3, which shows the execution process of the job after it is parallelized. JobManager assigns and deploys computing tasks to multiple TaskManagers. At this point, a Flink job officially starts executing.

During the execution of computing tasks, TaskManager may exchange data with other TaskManagers, using a specific data exchange strategy. At the same time, the TaskManager will also pass the status information of the task to the JobManager, which includes the startup, execution and termination status of the task, as well as the metadata of the snapshot, etc.

Flink core components

Based on this job submission process, we can introduce the functions and roles of the various components involved in more detail:

  1. Client (client) : Users usually use the client tools provided by Flink (such as command-line tools located in the bin directory under the Flink main directory) to submit jobs. The client will preprocess the Flink job submitted by the user and submit the job to the Flink cluster. When submitting a job, the client needs to configure some necessary parameters, such as whether to use a Standalone cluster or a YARN cluster. The entire job will be packaged into a JAR file, and the DataStream API will be converted into a JobGraph, which is similar to the logical view (as shown in Figure 3-2).

  2. Dispatcher (scheduler) : Dispatcher can receive multiple jobs, and each time a job is received, a JobManager will be assigned to the job. Dispatcher uses Hypertext Transfer Protocol (HTTP) to provide external services by providing a Representational State Transfer (REST) ​​interface.

  3. JobManager (job manager) : The JobManager is the coordinator of individual Flink jobs. Each job has a corresponding JobManager responsible for management. JobManager converts the JobGraph submitted by the client into ExecutionGraph, which is similar to the parallel physical execution graph (as shown in Figure 3-3). JobManager will apply to ResourceManager for the required resources. Once enough resources are obtained, the JobManager will distribute the ExecutionGraph and its computing tasks to multiple TaskManagers. In addition, JobManager also manages multiple TaskManagers, including collecting job status information, generating checkpoints, and performing failure recovery when necessary.

  4. ResourceManager (Resource Manager) : Flink can be deployed in environments such as Standalone, YARN, and Kubernetes, and different environments have different management modes for computing resources. In order to solve the problem of resource allocation, Flink introduces the ResourceManager module. In Flink, the basic unit of computing resources is the task slot (Slot) on the TaskManager. The main responsibility of ResourceManager is to obtain computing resources from resource providers (such as YARN). When the JobManager needs computing resources, the ResourceManager will allocate idle Slots to the JobManager. After the computing task ends, ResourceManager will recycle these idle slots.

  5. TaskManager (task manager) : TaskManager is the node that actually executes computing tasks. Generally speaking, a Flink job will be distributed and executed on multiple TaskManagers, and each TaskManager provides a certain number of Slots. When a TaskManager is started, the relevant Slot information will be registered in the ResourceManager. When the Flink job is submitted, the ResourceManager will assign the idle Slot to the JobManager. Once the JobManager acquires idle slots, it will deploy specific computing tasks to these slots and execute them on these slots. During execution, TaskManager may need to exchange data with other TaskManagers, so necessary data communication is required. In short, TaskManager is responsible for the execution of specific computing tasks, and it will register Slot resources with ResourceManager at startup.

Flink component stack

  1. Deployment layer :

    • Local mode : Flink supports local mode, including single node (SingleNode) and single virtual machine (SingleJVM) modes. In SingleNode mode, JobManager and TaskManager run on the same node; in SingleJVM mode, all roles run in the same JVM.
    • Cluster mode : Flink can be deployed on Standalone, YARN, Mesos and Kubernetes clusters. The Standalone cluster needs to configure the nodes of JobManager and TaskManager, and then start it through the script provided by Flink. YARN, Mesos and Kubernetes clusters provide more powerful resource management and cluster expansion capabilities.
    • Cloud mode : Flink can also be deployed on major cloud platforms, such as AWS, Google Cloud and Alibaba Cloud, enabling users to flexibly deploy and run jobs in cloud environments.
  2. Runtime layer :

    • The runtime layer is the core component of Flink, supporting distributed execution and processing. This layer is responsible for converting the jobs submitted by users into tasks and distributing them to the corresponding JobManager and TaskManager for execution. The runtime layer also covers checkpointing and failure recovery mechanisms to ensure fault tolerance and stability of jobs.
  3. API layer :

    • Flink's API layer provides DataStream API and DataSet API for stream processing and batch processing respectively. These two APIs allow developers to use various operators and transformations to process data, including computational tasks such as transformations, joins, aggregations, and windows.
  4. Upper tool :

    • On top of the API layer, Flink provides some tools to extend its functionality:
      • Complex Event Processing (CEP) : A stream processing-oriented library for detecting and processing complex event patterns.
      • Graph Computing Library (Gelly) : A batch-oriented graph computing library for executing graph algorithms.
      • Table API and SQL : An interface for SQL users and relational data processing scenarios, allowing the use of SQL syntax and table operations to process stream and batch data.
      • PyFlink : An interface for Python users, enabling them to use Flink for data processing, currently mainly based on the Table API.

To sum up, Flink provides a wealth of components and tools at different levels, supports streaming and batch processing, and seamless integration with different environments (local, cluster, cloud), enabling developers to flexibly build and Deploy large-scale data processing applications.

job execution stage

In Apache Flink, the execution process of a data flow job can be divided into multiple stages, from logical view to physical execution graph conversion. This process includes from StreamGraph to JobGraph, then to ExecutionGraph, and finally mapped to the actual physical execution graph. The process is detailed below:

  1. StreamGraph (logical view) : A StreamGraph is a logical representation of a user-written stream processing application. It includes the transformation operation of data flow, the relationship between operators, event time processing strategy, fault tolerance configuration, etc. StreamGraph is a user-defined data flow topology, which is a high-level abstraction. Users can build StreamGraph through DataStream API.

  2. JobGraph (job graph) : JobGraph is derived from StreamGraph and represents a specific job execution plan. In JobGraph, logical operators in StreamGraph are mapped to specific physical operators, and there is a clear execution sequence and dependencies between tasks. JobGraph also contains information such as resource configuration, task parallelism, and optimization options. JobGraph is a key step from logical view to physical execution.

  3. ExecutionGraph (execution graph) : ExecutionGraph is the execution time representation of JobGraph, which is the core of the actual execution plan. In ExecutionGraph, each task in JobGraph is mapped to a specific execution task, and each task can contain one or more subtasks, which are mapped to different TaskManagers. ExecutionGraph is also responsible for maintaining the execution status of jobs, as well as scheduling and communication between tasks.

  4. Physical execution graph : ExecutionGraph is mapped to the actual physical execution graph, that is, the task topology actually executed on the TaskManager cluster. The physical execution diagram includes details such as parallel execution of tasks, data exchange, and task status management. It is the embodiment of the actual operation of jobs in a distributed environment.

To sum up, the conversion from StreamGraph to JobGraph to ExecutionGraph is a key step in Flink job execution plan. The conversion process from the logical view to the physical execution graph takes into account the topological structure of the job, resource allocation, task scheduling, etc., ensuring that the job can be executed efficiently in a distributed environment. This series of conversion processes allows users to describe job logic through high-level abstraction, and the Flink framework will be responsible for converting it into an executable task graph to realize data flow processing and calculation.

5. Flink development

The Flink application structure mainly includes three parts, Source/Transformation/Sink, as shown in the following figure:

code program structure

Source: Data source, there are about 4 types of sources for Flink in stream processing and batch processing:

  • Sources based on local collections
  • file-based source
  • Web socket based source
  • custom source. Common custom sources include Apache Kafka, Amazon Kinesis Streams, RabbitMQ, Twitter Streaming API, Apache NiFi, etc. Of course, you can also define your own source.

Transformation: Various operations of data conversion, including Map / FlatMap / Filter / KeyBy / Reduce / Fold / Aggregations / Window / WindowAll / Union / Window join / Split / Select / Project, etc. There are many operations, and data conversion can be calculated for you the desired data.

Sink: Receiver, where Flink sends the converted and calculated data. You may need to store it. Flink’s common sinks are roughly as follows:

  • write to file
  • printout
  • write to socket
  • custom sink. Common custom sinks include Apache kafka, RabbitMQ, MySQL, ElasticSearch, Apache Cassandra, Hadoop FileSystem, etc. Similarly, you can also define your own sink.

Build the development environment

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.lagou</groupId>
    <artifactId>flinkdemo</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <!--flink核心包-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>1.7.2</version>
        </dependency>
        <!--flink流处理包-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.12</artifactId>
            <version>1.7.2</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-redis_2.11</artifactId>
            <version>1.1.5</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-hadoop-compatibility_2.12</artifactId>
            <version>1.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_2.12</artifactId>
            <version>1.7.2</version>
        </dependency>

        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.73</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-runtime-web_2.12</artifactId>
            <version>1.7.2</version>
        </dependency>

    </dependencies>
    <build>
        <plugins>
            <!-- 打jar插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

        </plugins>
    </build>

</project>

Flink-connector

In the actual production environment, data is usually distributed in various systems, including file systems, databases, message queues, etc. As a big data processing framework, Flink needs to interact with these external systems to realize data input, processing and output. In Flink, Source and Sink are two key modules, which play an important role in data connection and interaction with external systems, and are collectively referred to as external connectors (Connector).

  1. Source (data source) : Source is the input module of Flink job, which is used to read data from external systems and convert it into Flink data stream. Source is responsible for implementing the interaction logic with different data sources, and reading data from external data sources into Flink's data stream one by one or in batches for subsequent data processing. Common sources include reading data from files, consuming data from message queues (such as Kafka, RabbitMQ), reading data from databases, etc.

  2. Sink (data receiver) : Sink is the output module of Flink job, which is used to output the result of Flink calculation to the external system. Sink is responsible for writing the data in the Flink data stream to an external data source for subsequent persistent storage, display or other processing. The implementation of Sink needs to consider data reliability, consistency, and possible transactional requirements. Common sinks include writing data to files, writing data to databases, and writing data to message queues, etc.

The role of external connectors in Flink is very critical. They enable Flink jobs to interact with various types of data sources and data destinations, and realize the inflow and outflow of data. This flexible connection mechanism enables Flink to better integrate existing systems and data when processing big data, and realize complex data flow processing and analysis tasks.

Source

There are two main types of sources commonly used by Flink in batch processing.

  • Based on local collection source (Collection-based-source)
  • File-based source (File-based-source)
Source based on local collection

There are three most common DataSet methods for creating local collections in Flink.

  1. Using env.fromElements() , this method also supports composite forms such as Tuple and custom objects.
  2. Use env.fromCollection() , which supports multiple specific types of Collection.
  3. Use env.generateSequence() , this method creates a Sequence-based DataSet.

The way to use it is as follows:

package com.demo.broad;

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.DataSet;

import java.util.ArrayList;
import java.util.List;
import java.util.ArrayDeque;
import java.util.Stack;
import java.util.stream.Stream;

public class BatchFromCollection {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 获取flink执行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // 0.用element创建DataSet(fromElements)
        DataSet<String> ds0 = env.fromElements("spark", "flink");
        ds0.print();

        // 1.用Tuple创建DataSet(fromElements)
        DataSet<Tuple2<Integer, String>> ds1 = env.fromElements(
            new Tuple2<>(1, "spark"),
            new Tuple2<>(2, "flink")
        );
        ds1.print();

        // 2.用Array创建DataSet
        DataSet<String> ds2 = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds2.print();

        // 3.用ArrayDeque创建DataSet
        DataSet<String> ds3 = env.fromCollection(new ArrayDeque<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds3.print();

        // 4.用List创建DataSet
        DataSet<String> ds4 = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds4.print();

        // 5.用ArrayList创建DataSet
        DataSet<String> ds5 = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds5.print();

        // 6.用List创建DataSet
        DataSet<String> ds6 = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds6.print();

        // 7.用List创建DataSet
        DataSet<String> ds7 = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds7.print();

        // 8.用Stack创建DataSet
        DataSet<String> ds8 = env.fromCollection(new Stack<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds8.print();

        // 9.用Stream创建DataSet(Stream相当于lazy List,避免在中间过程中生成不必要的集合)
        DataSet<String> ds9 = env.fromCollection(Stream.of("spark", "flink"));
        ds9.print();

        // 10.用List创建DataSet
        DataSet<String> ds10 = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds10.print();

        // 11.用HashSet创建DataSet
        DataSet<String> ds11 = env.fromCollection(new HashSet<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds11.print();

        // 12.用Iterable创建DataSet
        DataSet<String> ds12 = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds12.print();

        // 13.用ArrayList创建DataSet
        DataSet<String> ds13 = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds13.print();

        // 14.用Stack创建DataSet
        DataSet<String> ds14 = env.fromCollection(new Stack<String>() {
    
    {
    
    
            add("spark");
            add("flink");
        }});
        ds14.print();

        // 15.用HashMap创建DataSet
        DataSet<Tuple2<Integer, String>> ds15 = env.fromCollection(new HashMap<Integer, String>() {
    
    {
    
    
            put(1, "spark");
            put(2, "flink");
        }}.entrySet());
        ds15.print();

        // 16.用Range创建DataSet
        DataSet<Integer> ds16 = env.fromCollection(IntStream.rangeClosed(1, 8).boxed().collect(Collectors.toList()));
        ds16.print();

        // 17.用generateSequence创建DataSet
        DataSet<Long> ds17 = env.generateSequence(1, 9);
        ds17.print();
    }
}
File-based Source

Flink supports reading files directly from an external file storage system to create a Source data source. Flink supports the following methods:

  1. read local file data
  2. Read HDFS file data
  3. Read CSV file data
  4. read compressed files
  5. traverse directory

The following describes how to load each data source:

read local file
package com.demo.batch;

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;

public class BatchFromFile {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 使用readTextFile读取本地文件
        // 初始化环境
        ExecutionEnvironment environment = ExecutionEnvironment.getExecutionEnvironment();

        // 加载数据
        DataSet<String> datas = environment.readTextFile("data.txt");

        // 触发程序执行
        datas.print();
    }
}
Read HDFS file data
package com.demo.batch;

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;

public class BatchFromFile {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 使用readTextFile读取本地文件
        // 初始化环境
        ExecutionEnvironment environment = ExecutionEnvironment.getExecutionEnvironment();

        // 加载数据
        DataSet<String> datas = environment.readTextFile("hdfs://node01:8020/README.txt");

        // 触发程序执行
        datas.print();
    }
}
Read CSV file data
package com.demo.batch;

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.common.functions.MapFunction;

public class BatchFromCsvFile {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 初始化环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // 用于映射CSV文件的POJO class
        public static class Student {
    
    
            public int id;
            public String name;

            public Student() {
    
    }

            public Student(int id, String name) {
    
    
                this.id = id;
                this.name = name;
            }

            @Override
            public String toString() {
    
    
                return "Student(" + id + ", " + name + ")";
            }
        }

        // 读取CSV文件
        DataSet<Student> csvDataSet = env.readCsvFile("./data/input/student.csv")
            .ignoreFirstLine()
            .pojoType(Student.class, "id", "name");

        csvDataSet.print();
    }
}
read compressed files

For the following compression types, there is no need to specify any additional inputformat methods, and flink can automatically identify and decompress them. However, compressed files may not be read in parallel, but may be read sequentially, which may affect the scalability of the job.

Compression format extension name Parallelization
DEFLATE .deflate no
GZIP .gz .gzip no
bzip2 .bz2 no
XZ .xz no
package com.demo.batch;

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;

public class BatchFromCompressFile {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 初始化环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // 加载数据
        DataSet<String> result = env.readTextFile("D:\\BaiduNetdiskDownload\\hbase-1.3.1-bin.tar.gz");

        // 触发程序执行
        result.print();
    }
}
traverse directory

Flink supports traversal access to all files in a file directory, including all files in all subdirectories.

For reading data from files, when reading several folders, nested files will not be read by default, only the first file will be read, and the others will be ignored. So we need to use recursive.file.enumeration for recursive reading

package com.demo.batch;

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;

public class BatchFromCompressFile {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 初始化环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // 加载数据
        DataSet<String> result = env.readTextFile("D:\\BaiduNetdiskDownload\\hbase-1.3.1-bin.tar.gz");

        // 触发程序执行
        result.print();
    }
}
read kafka
public class StreamFromKafka {
    
    
    public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers","teacher2:9092");
        FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>("mytopic2", new SimpleStringSchema(), properties);
        DataStreamSource<String> data = env.addSource(consumer);
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndOne = data.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
    
    
            public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
    
    
                for (String word : s.split(" ")) {
    
    
                    collector.collect(Tuple2.of(word, 1));
                }
            }
        });
        SingleOutputStreamOperator<Tuple2<String, Integer>> result = wordAndOne.keyBy(0).sum(1);
        result.print();
        env.execute();
    }
}

custom source
private static class SimpleSource  
implements SourceFunction<Tuple2<String, Integer>> {
    
     
 
    private int offset = 0; 
    private boolean isRunning = true; 
 
    @Override 
    public void run(SourceContext<Tuple2<String, Integer>> ctx) throws Exception {
    
     
        while (isRunning) {
    
     
            Thread.sleep(500); 
            ctx.collect(new Tuple2<>("" + offset, offset)); 
            offset++; 
            if (offset == 1000) {
    
     
                  isRunning = false; 
            } 
        } 
    } 
 
    @Override 
    public void cancel() {
    
     
          isRunning = false; 
    } 
} 

Customize the Source, start counting from 0, send the number to the downstream and call this Source in the main logic.

DataStream<Tuple2<String, Integer>> countStream = env.addSource(new SimpleSource()); 

Sink

Flink's common sink in batch processing

  • A sink based on a local collection (Collection-based-sink)
  • File-based sink (File-based-sink)
Sinks based on local collections

Target:

Based on the following data, respectively print output, error output, collect()

(19, "zhangsan", 178.8),
(17, "lisi", 168.8),
(18, "wangwu", 184.8),
(21, "zhaoliu", 164.8)

code:

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.util.Collector;

import java.util.ArrayList;
import java.util.List;

public class BatchSinkCollection {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        List<Tuple3<Integer, String, Double>> stuData = new ArrayList<>();
        stuData.add(new Tuple3<>(19, "zhangsan", 178.8));
        stuData.add(new Tuple3<>(17, "lisi", 168.8));
        stuData.add(new Tuple3<>(18, "wangwu", 184.8));
        stuData.add(new Tuple3<>(21, "zhaoliu", 164.8));

        DataSet<Tuple3<Integer, String, Double>> stu = env.fromCollection(stuData);

        stu.print();
        stu.printToErr();
        
        stu.collect().forEach(System.out::println);

        env.execute();
    }
}
file-based sink
  • Flink supports files on various storage devices, including local files, hdfs files, etc.
  • Flink supports a variety of file storage formats, including text files, CSV files, etc.
  • writeAsText(): TextOuputFormat - Writes the element as a string to the line. Strings are obtained by calling the toString() method of each element.
write data to local file

Target:

Based on the following data, write to the file

Map(1 -> "spark", 2 -> "flink")

code:

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.core.fs.Path;
import org.apache.flink.core.io.SimpleVersionedSerializer;

import java.util.HashMap;
import java.util.Map;

public class BatchSinkFile {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        Map<Integer, String> data1 = new HashMap<>();
        data1.put(1, "spark");
        data1.put(2, "flink");

        DataSet<Map<Integer, String>> ds1 = env.fromElements(data1);

        ds1.setParallelism(1)
            .writeAsText("test/data1/aa", FileSystem.WriteMode.OVERWRITE)
            .setParallelism(1);

        env.execute();
    }
}
write data to HDFS
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.core.fs.Path;
import org.apache.flink.core.io.SimpleVersionedSerializer;

import java.util.HashMap;
import java.util.Map;

public class BatchSinkFile {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        Map<Integer, String> data1 = new HashMap<>();
        data1.put(1, "spark");
        data1.put(2, "flink");

        DataSet<Map<Integer, String>> ds1 = env.fromElements(data1);

        ds1.setParallelism(1)
            .writeAsText("hdfs://bigdata1:9000/a", FileSystem.WriteMode.OVERWRITE)
            .setParallelism(1);

        env.execute();
    }
}

Flink API

Flink's API layer provides DataStream API and DataSet API for stream processing and batch processing respectively. These two APIs allow developers to use various operators and transformations to process data, including computational tasks such as transformations, joins, aggregations, and windows.

In Flink, according to different scenarios (stream processing or batch processing), different execution environments need to be set. In the batch processing scenario, you need to use the DataSet API and set up the batch execution environment. In the stream processing scenario, you need to use the DataStream API and set up the stream processing execution environment.

The following are sample codes for setting up the execution environment in different scenarios, respectively showing batch processing and stream processing, including Scala and Java languages.

Batch Scenario - Set up the batch execution environment for the DataSet API (Java) :

import org.apache.flink.api.java.ExecutionEnvironment;

public class BatchJobExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 创建批处理执行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // 在这里添加批处理作业的代码逻辑
        // ...

        // 执行作业
        env.execute("Batch Job Example");
    }
}

Stream processing scenario - Set up the stream processing execution environment (Java) of the DataStream API :

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class StreamJobExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 创建流处理执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 在这里添加流处理作业的代码逻辑
        // ...

        // 执行作业
        env.execute("Stream Job Example");
    }
}

Batch Scenario - Set up the batch execution environment for the DataSet API (Scala) :

import org.apache.flink.api.scala._

object BatchJobExample {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // 创建批处理执行环境
    val env = ExecutionEnvironment.getExecutionEnvironment

    // 在这里添加批处理作业的代码逻辑
    // ...

    // 执行作业
    env.execute("Batch Job Example")
  }
}

Stream processing scenario - Set up the stream processing execution environment (Scala) of the DataStream API :

import org.apache.flink.streaming.api.scala._

object StreamJobExample {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // 创建流处理执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    // 在这里添加流处理作业的代码逻辑
    // ...

    // 执行作业
    env.execute("Stream Job Example")
  }
}

According to the above sample code, you can choose the appropriate execution environment and API to build and execute Flink jobs in different scenarios. Note that when importing packages, make sure to use the correct package and class names for batch or streaming environments.

The following are some commonly used API functions and operations, provided in tabular form:

API type Common Functions and Operations describe
DataStream API map, flatMap Map or flatten each element in the data stream.
filter Filter out the elements that meet the condition.
keyBy Partitions a data stream by a specified field or key.
window Divide the data stream into time windows or count windows.
reduce, fold Aggregate operations on elements within the window.
union Combine multiple data streams.
connect, coMap, coFlatMap Join two data streams of different types and apply the corresponding function.
timeWindow,countWindow Define a time window or a count window.
process Customize processing functions to implement more complex stream processing logic.
DataSets API map,flatMap Map or flatten each element in the dataset.
filter Filter out the elements that meet the condition.
groupBy Groups a dataset by a specified field or key.
reduce,fold Aggregate the grouped data sets.
join,coGroup Perform an inner or outer join operation on two datasets.
cross,cartesian Performs a Cartesian product operation on two datasets.
distinct Remove duplicate elements from a dataset.
groupBy,aggregate Group and aggregate the grouped data sets.
first, min, max Get the first, smallest or largest element in the dataset.
sum,avg Computes the sum or average of the elements in a dataset.
collect Collect the elements in the dataset into a local collection.

These API functions and operations cover the common operations of stream processing and batch processing in Flink, and can help users implement various complex data processing and analysis tasks. According to actual needs, you can choose appropriate API functions and operations to build Flink jobs.

Here are some descriptions of the APIs that you can refer to:

map


Convert each element in the DataSet to another element

example

Using the map operation, the following data

"1,张三", "2,李四", "3,王五", "4,赵六"

Convert to a scala example class.

step

  1. Get ExecutionEnvironmentthe running environment
  2. fromCollectionBuild data sources using
  3. Create a Usersample class
  4. Use mapoperations to perform transformations
  5. print test

Reference Code

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.common.functions.MapFunction;

public class User {
    
    
    public String id;
    public String name;

    public User() {
    
    }

    public User(String id, String name) {
    
    
        this.id = id;
        this.name = name;
    }

    @Override
    public String toString() {
    
    
        return "User(" + id + ", " + name + ")";
    }

    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<String> textDataSet = env.fromCollection(
            Arrays.asList("1,张三", "2,李四", "3,王五", "4,赵六")
        );

        DataSet<User> userDataSet = textDataSet.map(new MapFunction<String, User>() {
    
    
            @Override
            public User map(String text) throws Exception {
    
    
                String[] fieldArr = text.split(",");
                return new User(fieldArr[0], fieldArr[1]);
            }
        });

        userDataSet.print();
    }
}

flatMap


Convert each element in the DataSet to 0...n elements

example

Transform the following data into three-dimensional data of 国家, 省份, and respectively.城市

the following data

张三,中国,江西省,南昌市
李四,中国,河北省,石家庄市
Tom,America,NewYork,Manhattan

converted to

(张三,中国)
(张三,中国,江西省)
(张三,中国,江西省,江西省)
(李四,中国)
(李四,中国,河北省)
(李四,中国,河北省,河北省)
(Tom,America)
(Tom,America,NewYork)
(Tom,America,NewYork,NewYork)

train of thought

  • The above data is converted from one to three, obviously, it should be flatMaprealized by using

  • Construct three data in flatMapthe function and put them into a list

    姓名, 国家
    姓名, 国家省份
    姓名, 国家省份城市

步骤

  1. 构建批处理运行环境
  2. 构建本地集合数据源
  3. 使用flatMap将一条数据转换为三条数据
    • 使用逗号分隔字段
    • 分别构建国家、国家省份、国家省份城市三个元组
  4. 打印输出

参考代码

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.util.Collector;

import java.util.ArrayList;
import java.util.List;

public class UserProcessing {
    
    
    public static class User {
    
    
        public String name;
        public String country;
        public String province;
        public String city;

        public User() {
    
    }

        public User(String name, String country, String province, String city) {
    
    
            this.name = name;
            this.country = country;
            this.province = province;
            this.city = city;
        }

        @Override
        public String toString() {
    
    
            return "User(" + name + ", " + country + ", " + province + ", " + city + ")";
        }
    }

    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<String> userDataSet = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("张三,中国,江西省,南昌市");
            add("李四,中国,河北省,石家庄市");
            add("Tom,America,NewYork,Manhattan");
        }});

        DataSet<User> resultDataSet = userDataSet.flatMap(new FlatMapFunction<String, User>() {
    
    
            @Override
            public void flatMap(String text, Collector<User> collector) throws Exception {
    
    
                String[] fieldArr = text.split(",");
                String name = fieldArr[0];
                String country = fieldArr[1];
                String province = fieldArr[2];
                String city = fieldArr[3];

                collector.collect(new User(name, country, province, city));
                collector.collect(new User(name, country, province + city, ""));
                collector.collect(new User(name, country, province + city, city));
            }
        });

        resultDataSet.print();
    }
}

mapPartition


将一个分区中的元素转换为另一个元素

示例

使用mapPartition操作,将以下数据

"1,张三", "2,李四", "3,王五", "4,赵六"

转换为一个scala的样例类。

步骤

  1. 获取ExecutionEnvironment运行环境
  2. 使用fromCollection构建数据源
  3. 创建一个User样例类
  4. 使用mapPartition操作执行转换
  5. 打印测试

参考代码

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.common.functions.MapPartitionFunction;
import org.apache.flink.util.Collector;

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

public class MapPartitionExample {
    
    
    public static class User {
    
    
        public String id;
        public String name;

        public User() {
    
    }

        public User(String id, String name) {
    
    
            this.id = id;
            this.name = name;
        }

        @Override
        public String toString() {
    
    
            return "User(" + id + ", " + name + ")";
        }
    }

    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<String> userDataSet = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("1,张三");
            add("2,李四");
            add("3,王五");
            add("4,赵六");
        }});

        DataSet<User> resultDataSet = userDataSet.mapPartition(new MapPartitionFunction<String, User>() {
    
    
            @Override
            public void mapPartition(Iterable<String> iterable, Collector<User> collector) throws Exception {
    
    
                // TODO: 打开连接

                Iterator<String> iterator = iterable.iterator();
                while (iterator.hasNext()) {
    
    
                    String ele = iterator.next();
                    String[] fieldArr = ele.split(",");
                    collector.collect(new User(fieldArr[0], fieldArr[1]));
                }

                // TODO: 关闭连接
            }
        });

        resultDataSet.print();
    }
}

mapmapPartition的效果是一样的,但如果在map的函数中,需要访问一些外部存储。例如:

访问mysql数据库,需要打开连接, 此时效率较低。而使用mapPartition可以有效减少连接数,提高效率

filter


过滤出来一些符合条件的元素

示例:

过滤出来以下以h开头的单词。

"hadoop", "hive", "spark", "flink"

步骤

  1. 获取ExecutionEnvironment运行环境
  2. 使用fromCollection构建数据源
  3. 使用filter操作执行过滤
  4. 打印测试

参考代码

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;

import java.util.ArrayList;
import java.util.List;

public class FilterExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<String> wordDataSet = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("hadoop");
            add("hive");
            add("spark");
            add("flink");
        }});

        DataSet<String> resultDataSet = wordDataSet.filter(word -> word.startsWith("h"));

        resultDataSet.print();
    }
}

reduce


可以对一个dataset或者一个group来进行聚合计算,最终聚合成一个元素

示例1

请将以下元组数据,使用reduce操作聚合成一个最终结果

("java" , 1) , ("java", 1) ,("java" , 1) 

将上传元素数据转换为("java",3)

步骤

  1. 获取ExecutionEnvironment运行环境
  2. 使用fromCollection构建数据源
  3. 使用redice执行聚合操作
  4. 打印测试

参考代码

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.tuple.Tuple2;

import java.util.ArrayList;
import java.util.List;

public class ReduceExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<String, Integer>> wordCountDataSet = env.fromCollection(new ArrayList<Tuple2<String, Integer>>() {
    
    {
    
    
            add(new Tuple2<>("java", 1));
            add(new Tuple2<>("java", 1));
            add(new Tuple2<>("java", 1));
        }});

        DataSet<Tuple2<String, Integer>> resultDataSet = wordCountDataSet.reduce((wc1, wc2) ->
            new Tuple2<>(wc2.f0, wc1.f1 + wc2.f1)
        );

        resultDataSet.print();
    }
}

示例2

请将以下元组数据,下按照单词使用groupBy进行分组,再使用reduce操作聚合成一个最终结果

("java" , 1) , ("java", 1) ,("scala" , 1)  

转换为

("java", 2), ("scala", 1)

步骤

  1. 获取ExecutionEnvironment运行环境
  2. 使用fromCollection构建数据源
  3. 使用groupBy按照单词进行分组
  4. reduceStatistics for each group using
  5. print test

Reference Code

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple;

import java.util.ArrayList;
import java.util.List;

public class GroupByReduceExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<String, Integer>> wordCountDataSet = env.fromCollection(new ArrayList<Tuple2<String, Integer>>() {
    
    {
    
    
            add(new Tuple2<>("java", 1));
            add(new Tuple2<>("java", 1));
            add(new Tuple2<>("scala", 1));
        }});

        DataSet<Tuple2<String, Integer>> groupedDataSet = wordCountDataSet.groupBy(0).reduce((wc1, wc2) ->
            new Tuple2<>(wc1.f0, wc1.f1 + wc2.f1)
        );

        groupedDataSet.print();
    }
}

reduceGroup


Aggregation calculations can be performed on a dataset or a group, and finally aggregated into one element

reduce and reduceGroup区别

The difference between reduce and reduceGroup

  • reduce is to pull the data one by one to another node, and then perform the calculation
  • reduceGroup is to perform calculations on the node where each group is located, and then pull

example

groupByPlease group the following tuple data according to the word usage , and then use reduceGroupthe operation to count the words

("java" , 1) , ("java", 1) ,("scala" , 1)  

step

  1. Get ExecutionEnvironmentthe running environment
  2. fromCollectionBuild data sources using
  3. group by groupByword
  4. reduceGroupStatistics for each group using
  5. print test

Reference Code

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

import java.util.ArrayList;
import java.util.List;

public class GroupByReduceGroupExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        DataSet<Tuple2<String, Integer>> wordCountDataSet = env.fromCollection(new ArrayList<Tuple2<String, Integer>>() {
    
    {
    
    
            add(new Tuple2<>("java", 1));
            add(new Tuple2<>("java", 1));
            add(new Tuple2<>("scala", 1));
        }});

        DataSet<Tuple2<String, Integer>> groupedDataSet = wordCountDataSet.groupBy(0).reduceGroup((Iterable<Tuple2<String, Integer>> iter, Collector<Tuple2<String, Integer>> collector) -> {
    
    
            Tuple2<String, Integer> result = new Tuple2<>();
            for (Tuple2<String, Integer> wc : iter) {
    
    
                result.f0 = wc.f0;
                result.f1 += wc.f1;
            }
            collector.collect(result);
        });

        groupedDataSet.print();
    }
}

aggregate


According to the built-in method for aggregation, Aggregate can only act on 元组the above. Example: SUM/MIN/MAX…

example

Please use the following tuple data to aggregateperform word statistics

("java" , 1) , ("java", 1) ,("scala" , 1)

step

  1. Get ExecutionEnvironmentthe running environment
  2. fromCollectionBuild data sources using
  3. group by groupByword
  4. Statistics aggregatefor each group usingSUM
  5. print test

Reference Code

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.aggregation.Aggregations;
import org.apache.flink.api.java.tuple.Tuple2;

import java.util.ArrayList;
import java.util.List;

public class GroupByAggregateExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<String, Integer>> wordCountDataSet = env.fromCollection(new ArrayList<Tuple2<String, Integer>>() {
    
    {
    
    
            add(new Tuple2<>("java", 1));
            add(new Tuple2<>("java", 1));
            add(new Tuple2<>("scala", 1));
        }});

        DataSet<Tuple2<String, Integer>> groupedDataSet = wordCountDataSet.groupBy(0);

        DataSet<Tuple2<String, Integer>> resultDataSet = groupedDataSet.aggregate(Aggregations.SUM, 1);

        resultDataSet.print();
    }
}

Notice

To use aggregate, only the field index name or index name can be used for grouping groupBy(0), otherwise an error will be reported:

Exception in thread "main" java.lang.UnsupportedOperationException: Aggregate does not support grouping with KeySelector functions, yet.

distinct


deduplicated data

example

Please use the following tuple data to distinctremove duplicate words

("java" , 1) , ("java", 2) ,("scala" , 1)

to get duplicates

("java", 1), ("scala", 1)

step

  1. Get ExecutionEnvironmentthe running environment
  2. fromCollectionBuild data sources using
  3. Use to distinctspecify which field to deduplicate
  4. print test

Reference Code

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.tuple.Tuple2;

import java.util.ArrayList;
import java.util.List;

public class DistinctExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<String, Integer>> wordCountDataSet = env.fromCollection(new ArrayList<Tuple2<String, Integer>>() {
    
    {
    
    
            add(new Tuple2<>("java", 1));
            add(new Tuple2<>("java", 1));
            add(new Tuple2<>("scala", 1));
        }});

        DataSet<Tuple2<String, Integer>> resultDataSet = wordCountDataSet.distinct(0);

        resultDataSet.print();
    }
}

join


Use join to connect two DataSets

Example:

There are two csv files, one for score.csvand one for subject.csv, which store the grade data and subject data respectively.

csv sample

These two data need to be concatenated and then printed out.

join results

step

  1. Copy the two files into the project data/join/inputrespectively

  2. Build a batch environment

  3. Create two example classes

    * 学科Subject(学科ID、学科名字)
    * 成绩Score(唯一ID、学生姓名、学科ID、分数——Double类型)
    
  4. Use respectively readCsvFileto load the csv data source and formulate the generic

  5. Use join to connect two DataSets, and use the wheremethod equalToto set the association conditions

  6. Print the associated data source

Reference Code

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.tuple.Tuple4;

public class JoinExample {
    
    
    public static class Score {
    
    
        public int id;
        public String name;
        public int subjectId;
        public double score;

        public Score() {
    
    }

        public Score(int id, String name, int subjectId, double score) {
    
    
            this.id = id;
            this.name = name;
            this.subjectId = subjectId;
            this.score = score;
        }

        @Override
        public String toString() {
    
    
            return "Score(" + id + ", " + name + ", " + subjectId + ", " + score + ")";
        }
    }

    public static class Subject {
    
    
        public int id;
        public String name;

        public Subject() {
    
    }

        public Subject(int id, String name) {
    
    
            this.id = id;
            this.name = name;
        }

        @Override
        public String toString() {
    
    
            return "Subject(" + id + ", " + name + ")";
        }
    }

    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Score> scoreDataSet = env.readCsvFile("./data/join/input/score.csv")
            .ignoreFirstLine()
            .pojoType(Score.class);

        DataSet<Subject> subjectDataSet = env.readCsvFile("./data/join/input/subject.csv")
            .ignoreFirstLine()
            .pojoType(Subject.class);

        DataSet<Tuple4<Integer, String, Integer, Double>> joinedDataSet = scoreDataSet.join(subjectDataSet)
            .where("subjectId")
            .equalTo("id")
            .projectFirst(0, 1, 2, 3)
            .projectSecond(1);

        joinedDataSet.print();
    }
}

union


The union of two DataSets will not deduplicate.

example

Perform union operation on the following data

Dataset 1

"hadoop", "hive", "flume"

Dataset 2

"hadoop", "hive", "spark"

step

  1. Build a batch execution environment
  2. fromCollectionCreate two data sources using
  3. unionLink the two data sources together using
  4. print test

Reference Code

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;

import java.util.ArrayList;
import java.util.List;

public class UnionExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<String> wordDataSet1 = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("hadoop");
            add("hive");
            add("flume");
        }});

        DataSet<String> wordDataSet2 = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("hadoop");
            add("hive");
            add("spark");
        }});

        DataSet<String> resultDataSet = wordDataSet1.union(wordDataSet2);

        resultDataSet.print();
    }
}

connect

connect() provides a function similar to union(), that is, connecting two data streams. The difference between it and union() is as follows.

  • connect() can only connect two data streams, and union() can connect multiple data streams.

  • The data types of the two data streams connected by connect() may be inconsistent, and the data types of two or more data streams connected by union() must be consistent.

  • Two DataStreams are transformed into ConnectedStreams after connect(). ConnectedStreams will apply different processing methods to the data of the two streams, and the state can be shared between the two streams.

DataStream<Integer> intStream  = senv.fromElements(2, 1, 5, 3, 4, 7); 
DataStream<String> stringStream  = senv.fromElements("A", "B", "C", "D"); 
 
ConnectedStreams<Integer, String> connectedStream =  
intStream.connect(stringStream); 
DataStream<String> mapResult = connectedStream.map(new MyCoMapFunction()); 

// CoMapFunction的3个泛型分别对应第一个流的输入类型、第二个流的输入类型,输出类型 
public static class MyCoMapFunction implements CoMapFunction<Integer, String, String> 
{
    
     
    @Override 
    public String map1(Integer input1) {
    
     
          return input1.toString(); 
    } 
 
    @Override 
    public String map2(String input2) {
    
     
          return input2; 
    } 
} 

rebalance


When Flink also generates 数据倾斜, for example: the current data volume is 1 billion, the following situations may occur during processing:

data skew

rebalanceThe polling method will be used to disperse the data evenly, which is the best choice for dealing with data skew.

rebalance

step

  1. Build a batch execution environment

  2. env.generateSequenceCreate parallel data from 0-100 using

  3. Use the fiterfiltered 大于8numbers

  4. Use the map operation to pass in RichMapFunction, and build the ID and number of the current subtask into a tuple

    在RichMapFunction中可以使用`getRuntimeContext.getIndexOfThisSubtask`获取子任务序号
    
  5. print test

Reference Code

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.util.Collector;

public class MapWithSubtaskIndexExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Long> numDataSet = env.generateSequence(0, 100);

        DataSet<Long> filterDataSet = numDataSet.filter(num -> num > 8);

        DataSet<Tuple2<Long, Long>> resultDataSet = filterDataSet.map(new RichMapFunction<Long, Tuple2<Long, Long>>() {
    
    
            @Override
            public Tuple2<Long, Long> map(Long in) throws Exception {
    
    
                return new Tuple2<>(getRuntimeContext().getIndexOfThisSubtask(), in);
            }
        });

        resultDataSet.print();
    }
}

The above code does not add rebalance. Through observation, there may be data skew.

After the filter is calculated, it is called rebalance, so that the data will be evenly distributed to each partition.

hashPartition


Hash partition according to the specified key

example

Create a data source based on the following list data, partition it according to hashPartition, and output it to a file.

List(1,1,1,1,1,1,1,2,2,2,2,2)

step

  1. Build a batch execution environment
  2. Set the degree of parallelism to2
  3. fromCollectionBuild a test dataset using
  4. partitionByHashPartition by hash according to the string
  5. call writeAsTextto write the file into data/partition_outputthe directory
  6. print test

Reference Code

 import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.core.fs.FileSystem;

import java.util.ArrayList;
import java.util.List;

public class PartitionByHashExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // Set parallelism to 2
        env.setParallelism(2);

        DataSet<Integer> numDataSet = env.fromCollection(new ArrayList<Integer>() {
    
    {
    
    
            add(1);
            add(1);
            add(1);
            add(1);
            add(1);
            add(1);
            add(1);
            add(2);
            add(2);
            add(2);
            add(2);
            add(2);
        }});

        DataSet<Integer> partitionDataSet = numDataSet.partitionByHash(num -> num.toString());

        partitionDataSet.writeAsText("./data/partition_output", FileSystem.WriteMode.OVERWRITE);

        partitionDataSet.print();
        env.execute();
    }
}

sortPartition


The specified field performs the data in the partition排序

example

Follow the list below to create a dataset

List("hadoop", "hadoop", "hadoop", "hive", "hive", "spark", "spark", "flink")

After sorting the partitions, output to a file.

step

  1. Build a batch execution environment
  2. fromCollectionBuild a test dataset using
  3. Set the parallelism of the dataset to2
  4. sortPartitionSort descending by string
  5. call writeAsTextto write the file into data/sort_outputthe directory
  6. start execution

Reference Code

import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.common.operators.Order;
import org.apache.flink.core.fs.FileSystem;

import java.util.ArrayList;
import java.util.List;

public class SortPartitionExample {
    
    
    public static void main(String[] args) throws Exception {
    
    
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<String> wordDataSet = env.fromCollection(new ArrayList<String>() {
    
    {
    
    
            add("hadoop");
            add("hadoop");
            add("hadoop");
            add("hive");
            add("hive");
            add("spark");
            add("spark");
            add("flink");
        }});

        wordDataSet.setParallelism(2);

        DataSet<String> sortedDataSet = wordDataSet.sortPartition(str -> str, Order.DESCENDING);

        sortedDataSet.writeAsText("./data/sort_output/", FileSystem.WriteMode.OVERWRITE);

        env.execute("App");
    }
}

window

In many cases, we need to solve such a problem: For a specific time period, such as an hour, we need to perform statistics and analysis on the data. However, to implement this kind of data window operation, it is first necessary to determine which data should enter this window. Before diving into the definition of window operations, we must first determine which temporal semantics the job will use.

In other words, time window is a key concept in data processing, which is used to divide data into specific time periods for calculation. However, before deciding how to define these windows, we must choose the appropriate temporal semantics, be it event time, processing time, or ingestion time. Different time semantics have different meanings and uses in data processing, so before choosing a time window, we need to clarify the time semantics required by the job in order to define and process the data window correctly.

time concept

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

windows program

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

rolling window

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

sliding window

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

session window

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

Quantity based window

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

trigger

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

clearer

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

6. Local execution and cluster execution of Flink programs

6.1. Local Execution

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

6.2. Cluster Execution

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

7. Flink broadcast variables

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

8. Accumulators in Flink

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

9. Flink's distributed cache

Omitted here due to character limit

For the complete content, please refer to "Big Data Flink Study Bible", pdf, find Nin for free

Knowledge about TABLE API & SQL, as well as status and checkpoints will be added one after another! Stay tuned!

say later

This article is the V1 version of "Big Data Flink Study Bible", which is a companion article of "Nin's Big Data Interview Collection".

Here is a special explanation: Since the first release of the 5 topic PDFs of "Nin's Big Data Interview Collection", hundreds of questions have been collected, and a large number of large-scale interviews are dry and authentic . The collection of interview questions in "Nin's Big Data Interview Collection" has become a must-read book for big data learning and interviews.

Therefore, the Nien architecture team struck while the iron was hot and launched the "Big Data Flink Study Bible" .

For the complete pdf, you can follow Nien's official account [Technical Freedom Circle] to receive it.

In addition, "Big Data Flink Study Bible" and "Nin's Big Data Interview Collection" will continue to be iterated and updated to absorb the latest interview questions. For the latest version, please refer to the official account [Technology Free Circle] for details

about the author

First work: Andy , senior architect, one of the authors of "Java High Concurrency Core Programming Enhanced Edition".

Second work: Nien , a 41-year-old senior architect, senior writer and famous blogger in the IT field. The creator of "Java High Concurrency Core Programming Enhanced Edition Volume 1, Volume 2, Volume 3". Author of 11 PDF Bibles including "K8S Study Bible", "Docker Study Bible", "Go Study Bible". He is also a senior architecture instructor and architecture transformation instructor . He has successfully guided a number of intermediate Java and senior Java transformation architect positions. The highest annual salary of the students is nearly 1 million .

reference

recommended reading

" Nin's Big Data Interview Collection Topic 1: The Most Complete Hadoop Interview Questions in History "

" Nin's Big Data Interview Collection Topic 2: Top Secret 100 Spark Interview Questions, Memorized 100 Times, Get a High Salary "

" Nin's Big Data Interview Collection Topic 3: The Most Complete Hive Interview Questions in History, Continuously Iterating and Continuously Upgrading "

" Nin's Big Data Interview Collection Topic 4: The Most Complete Flink Interview Questions in History, Constantly Iterating and Continuously Upgrading "

" Nin's Big Data Interview Collection Topic 5: The Most Complete HBase Interview Questions in History, Continuously Iterating and Continuously Upgrading "

"Nin's Architecture Notes", "Nin's High Concurrency Trilogy", "Nin's Java Interview Collection" PDF, please go to the following official account [Technical Freedom Circle] to take it↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/132347440