Preliminary exploration of Flink's Java implementation of stream processing and batch processing

During the Dragon Boat Festival holiday, the summer is hot and the temperature is continuously above 40 degrees. Learn Flink related knowledge at home and record it for future reference.
Development tool : IntelliJ Idea
Flink version : 1.13.0
This time, Flink is mainly used to implement simple implementation of batch processing (DataSet API) and stream processing (DataStream API).

The first step, create a project and add dependencies

1) Create a new project
Open Idea, create a new Maven project, name the package and project, and click OK to enter the project.
insert image description here
2) Introduce dependencies
inpom.xmlAdd dependencies in the file, that is, Flink-java, flink-streaming, slf4j, etc., you can refer to the following code.

<properties>
    <flink.version>1.13.0</flink.version>
    <java.version>1.8</java.version>
    <scala.binary.version>2.12</scala.binary.version>
    <slf4j.version>1.7.2</slf4j.version>
</properties>
<dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>${
    
    flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_${
    
    scala.binary.version}</artifactId>
        <version>${
    
    flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_${
    
    scala.binary.version}</artifactId>
        <version>${
    
    flink.version}</version>
    </dependency>
    <!-- 日志-->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>${
    
    slf4j.version}</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>${
    
    slf4j.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-to-slf4j</artifactId>
        <version>2.16.0</version>
    </dependency>
</dependencies>

3) Add log files Add log files in
the resource directorylog4j.properties, with the content shown below.

log4j.rootLogger=error,stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=@-4r [%t] %-5p %c %x - %m%n

The second step is to construct the data set

Create new under the projectinput folder, used to store data sets, create a new one under itwords.txtThe file, which is the test data set, is shown in the figure below.
insert image description here

The third step, write business code

Read the content in the data set and count the word counts of the words. Create a new BatchWordCout class and introduce 6 steps to read and print the data set.
Method 1.
The main processing steps of the batch processing DataSet API are
1) Create an execution environment;
2) Read data from the environment;
3) Segment each row of data and convert it into a binary type flat map;
4) Group by word ;
5) Aggregate statistics within groups;
6) Print results
The batch processing DataSet API is written as follows.

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.AggregateOperator;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.FlatMapOperator;
import org.apache.flink.api.java.operators.UnsortedGrouping;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

public class BatchWordCount {
    
    
    public static void main(String[] args) throws Exception {
    
    
        //1、创建执行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        // 2、从环境中读取数据
        DataSource<String> lineDataSource = env.readTextFile("input/words.txt");
        // 3、将每行数据进行分词，转化成二元组类型 扁平映射
        FlatMapOperator<String,Tuple2<String,Long>> wordAndOneTuple = lineDataSource.flatMap((String line, Collector<Tuple2<String,Long>> out) -> {
    
    
            // 将每行文本进行拆分
            String[] words = line.split(" ");
            // 将每个单词转化成二元组
            for(String word : words){
    
    
                out.collect(Tuple2.of(word,1L));
            }
        }).returns(Types.TUPLE(Types.STRING,Types.LONG));
        // 4、按照word进行分组
         UnsortedGrouping<Tuple2<String,Long>> wordAndOneGroup =  wordAndOneTuple.groupBy(0);
         // 5、分组内进行聚合统计
        AggregateOperator<Tuple2<String, Long>> sum = wordAndOneGroup.sum(1);
        // 6、打印结果
        sum.print();

    }

The console printing effect is shown in the figure below.
insert image description here
After Flink version 1.12, the official recommendation is to directly use the DataSet API, that is, change the execution mode to BATCH when submitting tasks for batch processing
$bin/flink run -Dexecution.runtime-mode=BATCH batchWordCount.jar

Method 2. Stream processing
The processing steps of DataStream API stream processing are similar to batch processing, the main difference is that the execution environment is different.

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class BatchSteamWordCount {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 1、创建流式执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 2、读取文件
        DataStreamSource<String> lineDataStreamSource = env.readTextFile("input/words.txt");
        // 3、转换计算
        SingleOutputStreamOperator<Tuple2<String, Long>> wordAndOneTuple = lineDataStreamSource.flatMap((String line, Collector<Tuple2<String, Long>> out) -> {
    
    
            // 将每行文本进行拆分
            String[] words = line.split(" ");
            // 将每个单词转化成二元组
            for (String word : words) {
    
    
                out.collect(Tuple2.of(word, 1L));
            }
        }).returns(Types.TUPLE(Types.STRING, Types.LONG));
        // 4、分组
        KeyedStream<Tuple2<String, Long>, Object> wordAndOneKeyedStream = wordAndOneTuple.keyBy(data -> data.f0);
        // 5、求和
        SingleOutputStreamOperator<Tuple2<String, Long>> sum = wordAndOneKeyedStream.sum(1);
        // 6、打印结果
        sum.print();
        // 7、启动执行
        env.execute();
    }
}

The console output is shown in the figure below.
insert image description here
From the printed results, it can be seen that multi-threaded execution is out of order; the first column of numbers is related to the number of CPU cores in the local operating environment;

reference documents

【1】https://www.bilibili.com/video/BV133411s7Sa?p=9&vd_source=c8717efb4869aaa507d74b272c5d90be