The basic concept part, the difference between batch processing and stream processing

Batch processing has a long history in the big data world, and the typical one is spark. Batch processing mainly operates on large-capacity static data sets, and returns the results after the calculation process is completed.

The data set used in batch mode usually meets the following characteristics:

(1) Bounded: batch data set represents a limited set of data

(2) Persistent: Data is usually always stored in some type of persistent storage location

(3) Large amounts: batch processing is usually the only way to process extremely large data sets

Batch processing is ideal for calculations that require access to a full set of records. For example, when calculating totals and averages, the data set must be processed as a whole, rather than as a collection of multiple records. These operations require the data to maintain its own state during the calculation.

Tasks that need to process large amounts of data are usually most suitable for processing with batch operations. Whether directly processing the data set from the persistent storage device or loading the data set into the memory first, the batch processing system fully considers the amount of data during the design process and can provide sufficient processing resources. Because batch processing is extremely good at dealing with large amounts of persistent data, it is often used to analyze historical data. The processing of large amounts of data requires a lot of time, so batch processing is not suitable for occasions that require high processing time. The stream processing system calculates the data that enters the system at any time. Compared with batch processing mode, this is a completely different processing method. The streaming method does not need to perform operations on the entire data set, but on each data item transmitted through the system.

Data sets in stream processing are "boundless", which has several important effects:

(1) The complete data set can only represent the total amount of data that has entered the system so far.

(2) The working data set may be more relevant and can only represent a single data item at a specific time.

(3) Processing is event-based, and there is no "end" unless it is explicitly stopped. The processing results are immediately available and will continue to be updated as new data arrives.

The stream processing system can process almost unlimited amounts of data, but it can only process one (true stream processing) or a small amount (micro-batch processing) data at a time, and only a minimum amount of state is maintained between different records. Although most systems provide methods for maintaining certain states, stream processing is mainly optimized for more functional processing with fewer side effects.

Functional operations mainly focus on discrete steps with limited states or side effects. Performing the same operation on the same data will produce the same result or omit other factors. This type of processing is very suitable for stream processing, because the status of different items is usually a combination of certain difficulties, limitations, and results that are not required in some cases. body. So although certain types of state management are usually feasible, these frameworks are usually simpler and more efficient when they do not have a state management mechanism.

This type of processing is very suitable for certain types of workloads. Tasks with near real-time processing requirements are very suitable for using stream processing mode. Analysis, server or application error logs, and other time-based metrics are the most suitable types, because responding to data changes in these areas is extremely critical for business functions. Stream processing is very suitable for processing data that must respond to changes or peaks and pay attention to the trend of changes over a period of time.

aims

Read a txt file and use flink's streaming mode and batch mode to calculate statistics

Start code, environment preparation

Use IDEA to create a new maven project and add 2 introductions in pom.xml

<dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-scala_2.12</artifactId>
        <version>1.10.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-scala_2.12</artifactId>
        <version>1.10.1</version>
    </dependency>
</dependencies>
<build>
    <plugins>
        <!--该插件用于将scala代码编译成class文件 -->
        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>4.4.0</version>
            <executions>
                <execution>
                    <goals>
                        <goal>compile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>

        <!--打包用 -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.3.0</version>
            <configuration>
                <descriptorRefs>
                    <descriptiorRef>jar-with-dependencies</descriptiorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Create a new scala directory in the same directory as java and set it as resource root

Batch mode processing worldCount

Create a new package com.mafei.wc. Next, create a new scala Object of WordCount to
run the first simple demo, read data from the file, do some filtering, segmentation, grouping statistics, summing, etc.

package com.mafei.wc

import org.apache.flink.api.scala.ExecutionEnvironment

//把scala里面定义的隐式转换拿出来
import org.apache.flink.api.scala._

object WordCount {
  def main(args: Array[String]): Unit = {
    //创建一个批处理执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    //从文件中读取数据
    val inputPath: String = "/opt/java2020_study/maven/flink1/src/main/resources/hello.txt"
    val inputDataSet: DataSet[String] = env.readTextFile(inputPath)

    //  对数据进行转换处理统计，先分词，再按照word进行分组，最后聚合统计
    val resultDataSet: DataSet[(String,Int)] = inputDataSet
      .flatMap(_.split(" ")) //根据空格分隔
      .map((_,1))
      .groupBy(0) // 以第一个元素作为key进行分组统计
      .sum(1)  //对分组之后的所有数据的第二个元素求和

    //打印输出
    resultDataSet.print()
   }

}

Code structure and operation effect

Flink from entry to real fragrance (1-run the first demo in streaming mode and batch mode respectively)

Stream processing sample test, the target monitors a socket port, obtains real-time output and calculates the result

1. Create a new StreamWorldCount stream processing class

package com.mafei.wc

import org.apache.flink.streaming.api.scala._

object StreamWordCount {
  def main(args: Array[String]): Unit = {

    //创建流处理的执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //接收一个socket文本流
    val inputDataStream = env.socketTextStream("127.0.0.1",7777)

    //进行转换处理统计
    val resultDataStreams = inputDataStream
      .flatMap(_.split(" ")) //按照空格进行分割
      .filter(_.nonEmpty) //过滤非空的数据
      .map((_, 1))  //每次给key设置数量
      .keyBy(0) //按照第一个key来做聚合
      .sum(1) //做统计

    resultDataStreams.print()
    //最终执行的操作
    env.execute("stream world count")

  }

}

2. Open a new terminal and monitor port 7777 of this machine

nc -lk 7777

3. Start the code and find that the program is monitoring the port data.
4. Go to the newly opened port in the second step and output characters at will, press Enter, you can see the code is in real-time calculation

The above are all flink tasks that run directly locally. Below, run a wave of
new test classes on the flink server , read data from the socket and do some grouping calculations, etc. The
difference is that the server IP information of the socket is not hard-coded. In the code, it is implemented by passing parameters during the flink runtime

package com.mafei.wc

import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._

object StreamWordCount2 {
  def main(args: Array[String]): Unit = {

    //创建流处理的执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
//    env.setParallelism(10)  //针对整个任务设置并行度
   //env.disableOperatorChaining()  //关闭任务合并
    //从外部命令中提取参数，作为socket主机名和端口号
    val paramTool: ParameterTool = ParameterTool.fromArgs(args)
    val host: String = paramTool.get("host")
//    val port: Int = paramTool.getInt("port")

    //接收一个socket文本流
    val inputDataStream = env.socketTextStream(host,7777)
//    val inputDataStream = env.socketTextStream("127.0.0.1",7777)

    //进行转换处理统计
    val resultDataStreams = inputDataStream
      .flatMap(_.split(" "))
      .filter(_.nonEmpty)
//      .map((_, 1)).setParallelism(3) //针对单个算子设置并行度
      .map((_, 1)).setParallelism(2) //针对单个算子设置并行度
      .keyBy(0)
      .sum(1)

    resultDataStreams.print().setParallelism(1)   //也可以针对输出设置并行度，用来类似输出到文件的场景等
    env.execute("stream world count")

  }

}

Package the code into a jar package, and then open the 8081 default web page of flink
, click the uploaded jar package on the Submit New Job column, upload the packaged jar package page, and
enter com.mafei.wc.StreamWordCount2 in the Entry Class column
Enter in Program Arguments: --host 127.0.0.1 --port 7777
Final effect:

Flink from entry to real fragrance (1-run the first demo in streaming mode and batch mode respectively)

Monitor the nc program on the server:
install nc: yum install nc -y
monitor port 7777: nc -lk 7777

The final task running effect diagram on the flink interface:
Flink from entry to real fragrance (1-run the first demo in streaming mode and batch mode respectively)

Look at the output of flink (remember to type some data on the socket terminal):

tail -f /opt/flink-1.10.2/log/flink*
(sd,1)
(fg,1)
(sdfg,1)
(s,1)
(dfg,1)
(sdf,1)
(g,1)
(wert,1)
(wert,2)
(xdfcg,1)

Command line execution mode:
manually upload the packaged jar package to the server

/opt/flink-1.10.2/bin/flink run -c com.mafei.wc.StreamWordCount2 -p 1 /opt/flink1-1.0-SNAPSHOT-jar-with-dependencies.jar --host localhost --port 7777

List all tasks:
/opt/flink-1.10.2/bin/flink list

List all tasks-including the completed
/opt/flink-1.10.2/bin/flink list -a

Cancel task
/opt/flink-1.10.2/bin/flink cancel 43dcc61e27b64e63306c9e9ab1b8e0f9

Flink from entry to real fragrance (1-run the first demo in streaming mode and batch mode respectively)

The basic concept part, the difference between batch processing and stream processing

aims

Start code, environment preparation

Batch mode processing worldCount

Code structure and operation effect

Stream processing sample test, the target monitors a socket port, obtains real-time output and calculates the result

Guess you like