Quick start Flink (6)-broadcast variables, distributed cache, accumulator in Flink (super detailed, quick collection)

Insert picture description here
The last article explained to you the commonly used operators in Flink ( 17 TransFormAction operators in Flink ). How to optimize the code you wrote to improve efficiency? Then we use distributed caching and broadcast variables to improve code efficiency.

1. Flink's broadcast variables (emphasis)

Introducing Flink broadcast variables and trial scenarios
Flink supports broadcast variables, which is to broadcast data to a specific taskmanager, and the data is stored in memory, which can slow down a large number of shuffle operations; for example, in the data join phase, a large number of shuffle operations are inevitable , We can broadcast one of the dataSets and load them into the memory of taskManager. We can directly retrieve the data in the memory to avoid a large number of shuffles and cause cluster performance degradation; after the broadcast variable is created, it can run in any cluster function without having to pass it to the cluster node multiple times. In addition, remember that broadcast variables should not be modified, so as to ensure that the values obtained by each node are consistent.
In one sentence, it can be understood as a public shared variable. We can broadcast a dataset, and then different tasks can be obtained on the node. There will only be one copy of this data on each node. If broadcast is not used, a copy of the dataset data set is required in each task in each node, which is a waste of memory (that is, there may be multiple copies of dataset data in one node).
Note: Because the broadcast variable is to broadcast the dataset to the memory, the amount of broadcast data cannot be too large, otherwise problems such as OOM will occur.
Insert picture description here

It can be understood that the broadcast variable is a public variable
After broadcasting a data set, different tasks can be obtained on the node
Only one copy per node
If broadcasting is not used, each task will copy a data set, causing memory resource waste.

Usage:
Use withBroadcastSet to create a broadcast
after the operation that needs to be broadcast. In the operation, use getRuntimeContext.getBroadcastVariable [broadcast data type] (broadcast name) to obtain broadcast variables

Example:
Create a student data set with the following data

|Student ID | Name|
List((1, "张三"), (2, "李四"), (3, "王五"))

Publish the data to the broadcast.
Create another score data set

|Student ID | Subjects| Grades|
List( (1, "Chinese", 50), (2, "Mathematics", 70), (3, "English", 86))

Turn the data into

List( ("Zhang San", "语文", 50),("Li Si", "Mathematics", 70), ("Wang Wu", "English", 86))

Implementation steps

Obtain the batch running environment
Create two data sets (student information, grade information)
Use RichMapFuncation to map the score data
After calling the map method in the data set, call withBroadcastSet to create a broadcast of student information
Realize RichMapFunction
a. Put the grade data (student ID, subject, grade) -> (student name, subject, grade)
b. Rewrite the open method to get the broadcast data
c. Import import scala.collection.JavaConverters._ implicit Conversion
d. Use asScala to convert the broadcast variable to a Scala collection, and use toList to convert to a scala toMap collection
. e. Use the broadcast variable to convert in the map mode
Printout

Code reference

import java.util

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

/**
 * 需求： 创建一个 学生数据集，包含以下数据
 * List((1, "张三"), (2, "李四"), (3, "王五"))
 * 再创建一个 成绩数据集，
 * |学生ID | 学科| 成绩|
 * List( (1, "语文", 50),(2, "数学", 70), (3, "英文", 86))
 * 请通过广播获取到学生姓名，将数据转换为
 * List( ("张三", "语文", 50),("李四", "数学", 70), ("王五", "英文", 86))
 *
 * @author
 * @date 2020/9/18 23:15
 * @version 1.0
 */
object BatchBroadcast {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //1.构建运行环境
    val env = ExecutionEnvironment.getExecutionEnvironment
    //2.构建数据集
    val student = env.fromCollection(List((1, "张三"), (2, "李四"), (3, "王五")))
    val score = env.fromCollection(List((1, "语文", 50), (2, "数学", 70), (3, "英文", 86)))
    //3.使用RichMapFunction 对成绩数据集进行map转换
    val result = score.map(new RichMapFunction[(Int, String, Int), (String, String, Int)] {
    
    
      // 定义一个map用来存放广播变量中的信息
      var studentMap: Map[Int, String] = null

      override def open(parameters: Configuration): Unit = {
    
    
        // 导入工具类将java代码转为scala
        import scala.collection.JavaConversions._
        // 获取广播变量中的信息
        val studentList: util.List[(Int, String)] = getRuntimeContext.getBroadcastVariable[(Int, String)]("student")
        studentMap = studentList.toMap
      }

      // 重写map方法返回指定数据
      override def map(value: (Int, String, Int)): (String, String, Int) = {
    
    
        val stuName = studentMap.getOrElse(value._1, "")
        (stuName, value._2, value._3)
      }
    }).withBroadcastSet(student, "student")
    // 结果输出
    result.print()
    /*(张三,语文,50)
      (李四,数学,70)
      (王五,英文,86)
*/
  }
}

2. Flink's distributed cache (focus)

Introducing distributed cache:
Flink provides a distributed cache similar to Hadoop , so that functions of parallel running instances can be accessed locally. This function can be used to share external static data, such as machine learning logistic regression models, etc.! Cache usage process: Use ExecutionEnvironment instance to local or remote file (for example: file on HDFS), specify a name for cache file to register the cache file! When the program is executed, Flink will automatically copy files or directories to the local file system of all worker nodes . The function can retrieve the file from the local file system of the node according to the name!
Note: Broadcasting is to distribute variables to the memory of each worker node, and distributed caching is to cache files to each worker node;
usage:
use registerCachedFile of the Flink runtime environment to register a distributed cache.
Use getRuntimeContext.getDistributedCache during operation . getFile (file name) to get the distributed cache

Example:

List( (1, "语文", 50),(2, "Mathematics", 70), (3, "English", 86))

Use distributed cache to get data and turn data into

List( ("Zhang San", "语文", 50),("Li Si", "Mathematics", 70), ("Wang Wu", "English", 86))

Note: The student.txt test file saves the student ID and the implementation steps of the student name
:

Student.txt file will be created
Obtain the batch running environment
Create a grade data set
Perform map conversion on the score data set, convert (student ID, subject, score) to (student name, subject, score)
a. In the open method of RichMapFunction, obtain the distributed cache data
b. Convert in the map method
Implement the open method
a. Use getRuntimeContext.getDistributedCache.getFile to get the distributed cache file
b. Use Scala.fromFile to read the file and get the line c. Convert the text into a tuple (student ID, student name), and then into a List
Implement the map method
a. Filter out students based on the student ID from the distributed cache
b. Get the student's name
c. Construct the final result tuple
Print test

Code reference

	
import java.io.File

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

import scala.io.Source

/** * 需求：
 * 创建一个 成绩 数据集
 * List( (1, "语文", 50),(2, "数学", 70), (3, "英文", 86))
 * 请通过分布式缓存获取到学生姓名，将数据转换为
 * List( ("张三", "语文", 50),("李四", "数学", 70), ("王五", "英文", 86))
 * 注： distribute_cache_student 测试文件保存了学生 ID 以及学生姓名
 *
 * @author
 * @date 2020/9/18 23:51
 * @version 1.0
 */
object BatchDisCachedFile {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //1.构建运行环境
    val env = ExecutionEnvironment.getExecutionEnvironment
    //2.构建数据集
    val scoreDataSet = env.fromCollection(List((1, "语文", 50), (2, "数学", 70), (3, "英文", 86)))
    //3.注册分布式缓存
    env.registerCachedFile("./data/student.txt", "student")
    val result = scoreDataSet.map(new RichMapFunction[(Int, String, Int), (String, String, Int)] {
    
    
      //定义一个map用来存储分布式缓存中的数据
      var studentMap: Map[Int, String] = null
      // 初始化操作
      override def open(parameters: Configuration): Unit = {
    
    
        // 获取缓存中的信息
        val student: File = getRuntimeContext.getDistributedCache.getFile("student")
        // 读取据按照每行数据返回
        val liens = Source.fromFile(student).getLines()
        //遍历数据进行返回
        studentMap = liens.map(s => {
    
    
          val arr = s.split(",")
          (arr(0).toInt, arr(1))
        }).toMap
      }

      override def map(value: (Int, String, Int)): (String, String, Int) = {
    
    
        val studentName = studentMap.getOrElse(value._1, "")
        (studentName, value._2, value._3)
      }
    })
    result.print()
  }
}

Three, Flink accumulator (Accumulators understand)

Introduction:
Accumulator is the accumulator, which is similar to the application scenario of Mapreduce counter. It can observe the data changes of the task during the running period. The accumulator can be operated in the operator function in the Flink job task, but only after the task execution ends. Only then can the final result of the accumulator be obtained. Counter is a specific accumulator (Accumulator) implemented IntCounter, LongCounter and DoubleCounter
examples

Requirements: Given a data source "a", "b", "c", "d" how many elements are printed out through the accumulator

Implementation steps:

Create accumulator
Register accumulator
Use accumulator
Get the result of the accumulator

Code reference


import org.apache.flink.api.common.accumulators.IntCounter
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

/** 需求：
 * 给定一个数据源
 * "a","b","c","d"
 * 通过累加器打印出多少个元素
 *
 * @author
 * @date 2020/9/19 0:17
 * @version 1.0
 */
object BatchCounter {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //1.构建运行环境
    val env = ExecutionEnvironment.getExecutionEnvironment
    //2.构建数据源
    val dataSet = env.fromElements("a", "b", "c", "d")
    val resultDataSet = dataSet.map(new RichMapFunction[String, String] {
    
    
      //定义一个累加器
      val counter: IntCounter = new IntCounter()

      override def open(parameters: Configuration): Unit = {
    
    
        getRuntimeContext.addAccumulator("MyAccumulator", counter)
      }

      override def map(value: String): String = {
    
    
        counter.add(1)
        value
      }
    })
    resultDataSet.writeAsText("./data/BatchCounter")
    val result = env.execute("BatchCounter")
    // 获取累加数据
    val value = result.getAccumulatorResult[Int]("MyAccumulator")
    println("累加器的最终结果是：" + value)
  }
}