Flink distributed caching and the accumulator

Distributed Cache

Flink provides a distributed cache, similar Hadoop, the user can easily read function in the parallel local file, and place it taskmanager node, to prevent repeated pull task. This caching mechanism is as follows: to register a file or directory (local or remote file systems, such as hdfs or s3), cache files and registered by ExecutionEnvironment up a name for it.

When the program is executed, Flink automatically copy files or directories to the local file system taskmanager all nodes will only be performed once. Users can find the file or directory specified by the name, and then access it from the local file system taskmanager node.

In fact, the distributed cache is equivalent to broadcast the spark, to a variable broadcast to all of the executor, it can also be seen in the broadcast stream Flink, but here is a broadcast file.

Examples

Step 1: Register a file, the file can be used on hdfs can also be local files for testing

Step 2: Use File

note:

Access the cache file or directory in the user function (here is a map function). This function must inherit RichFunction, because it requires the use RuntimeContext to read the data:

Distributed Cache java code is as follows:

/**
 * <p/> DataSet 分布式缓存 </li>
 * <li>@author: li.pan</li>
 * <li>Date: 2019/12/29 16:10 下午</li>
 * <li>Version: V1.0</li>
 * <li>Description: </li>
 */
public class JavaDataSetDistributedCacheApp {
​
    public static void main(String[] args) throws Exception {
​
        ExecutionEnvironment env =  ExecutionEnvironment.getExecutionEnvironment();
        String filePath = "file:///Users/lipan/workspace/flink_demo/flink-local-train/src/main/resources/sink/java/cache.txt";
​
        // step1: 注册一个本地/HDFS文件
        env.registerCachedFile(filePath, "lp-java-dc");
​
        DataSource<String> data = env.fromElements("hadoop","spark","flink","pyspark","storm");
​
        data.map(new RichMapFunction<String, String>() {
​
            List<String> list = new ArrayList<String>();
            // step2:在open方法中获取到分布式缓存的内容即可
            @Override
            public void open(Configuration parameters) throws Exception {
                File file = getRuntimeContext().getDistributedCache().getFile("lp-java-dc");
                List<String> lines = FileUtils.readLines(file);
                for(String line : lines) {
                    list.add(line);
                    System.out.println("line = [" + line + "]");
                }
            }
​
            @Override
            public String map(String value) throws Exception {
                return value;
            }
        }).print();
​
    }
}
​

Distributed Cache scala code is as follows:

/**
  * <p/>
  * <li>title: DataSet 分布式缓存</li>
  * <li>@author: li.pan</li>
  * <li>Date: 2019/11/23 2:15 下午</li>
  * <li>Version: V1.0</li>
  * <li>Description:
  * step1: 注册一个本地/HDFS文件
  * step2:在open方法中获取到分布式缓存的内容即可
  * </li>
  */

object DistributedCacheApp {

  def main(args: Array[String]): Unit = {

    val env = ExecutionEnvironment.getExecutionEnvironment

    val filePath = "file:///Users/lipan/workspace/flink_demo/flink-local-train/src/main/resources/sink/scala/cache.txt"

    // step1: 注册一个本地/HDFS文件
    env.registerCachedFile(filePath, "pk-scala-dc")

    import org.apache.flink.api.scala._
    val data = env.fromElements("hadoop", "spark", "flink", "pyspark", "storm")
    
    data.map(new RichMapFunction[String, String] {

      // step2:在open方法中获取到分布式缓存的内容即可
      override def open(parameters: Configuration): Unit = {
        val dcFile = getRuntimeContext.getDistributedCache().getFile("pk-scala-dc")

        val lines = FileUtils.readLines(dcFile) // java
        
        /**
          * 此时会出现一个异常:java集合和scala集合不兼容的问题
          */
        import scala.collection.JavaConverters._
        for (ele <- lines.asScala) { 
          println(ele)
        }
      }

      override def map(value: String): String = value
    }).print()
  }

}

Accumulator (Accumulators)

Accumulators (accumulator) is very simple, the final accumulation result by an add operation, the final result can be acquired in the job execution.

The simplest is the accumulator counter (counter): You can be incremented by Accumulator.add (V value) this method. Merging all the results at the end of the task, flink will then send the final results to the client. Accumulator in debugging or if you want to quickly understand when your data is very useful.

There are built-bit accumulator Flink, each accumulator have achieved Accumulator interface. IntCounter, LongCounter and DoubleCounter: The following is an example of a counter used. For example: distributed word counting program.

Example:

Step 1: create a custom transformation where you want to use operators to create objects in an accumulator operator in.

val counter = new LongCounter()

Step 2: the accumulator register the object, which is generally a function of the open method rich. Here you can also define a name

 getRuntimeContext.addAccumulator("ele-counts-scala", counter)

Step 3: Use this operation accumulator

counter.add(1)

Step 4: The final overall result will be stored in the object returned from JobExecutionResult execute () in the. (This operation needs to wait for the completion of the task execution)

val num = jobResult.getAccumulatorResult[Long]("ele-counts-scala")

Accumulator Java code is as follows:

/**
 * <p/>
 * <li>title: flink 计数器</li>
 * <li>@author: li.pan</li>
 * <li>Date: 2019/12/29 2:59 下午</li>
 * <li>Version: V1.0</li>
 * <li>Description:
 * Java实现通过一个add操作累加最终的结果,在job执行后可以获取最终结果
 * </li>
 */
public class JavaCounterApp {

    public static void main(String[] args) throws Exception {

        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        DataSource<String> data = env.fromElements("hadoop","spark","flink","pyspark","storm");

        DataSet<String> info = data.map(new RichMapFunction<String, String>() {

            LongCounter counter = new LongCounter();

            @Override
            public void open(Configuration parameters) throws Exception {
                super.open(parameters);

                getRuntimeContext().addAccumulator("ele-counts-java", counter);
            }

            @Override
            public String map(String value) throws Exception {
                counter.add(1);
                return value;
            }
        });

        String filePath = "file:///Users/lipan/workspace/flink_demo/flink-local-train/src/main/resources/sink/java/";
        info.writeAsText(filePath, FileSystem.WriteMode.OVERWRITE).setParallelism(2);
        JobExecutionResult jobResult = env.execute("CounterApp");

        // step3: 获取计数器
        long num = jobResult.getAccumulatorResult("ele-counts-java");

        System.out.println("num = [" + num + "]");
    }
}

Accumulator Scala code is as follows:

/**
  * <p/>

  * <li>Description: flink 计数器</li>
  * <li>@author: panli@[email protected]</li>

  * <li>Date: 2019-04-14 21:53</li>

  * Scala实现通过一个add操作累加最终的结果,在job执行后可以获取最终结果
  */
object CountApp {

  def main(args: Array[String]): Unit = {

    val env = ExecutionEnvironment.getExecutionEnvironment
    val data = env.fromElements("hadoop", "spark", "flink", "pyspark", "storm")


    val info = data.map(new RichMapFunction[String, String]() {
      // step1:定义计数器
      val counter = new LongCounter()

      override def open(parameters: Configuration): Unit = {
        // step2: 注册计数器
        getRuntimeContext.addAccumulator("ele-counts-scala", counter)
      }

      override def map(in: String): String = {
        counter.add(1)
        in
      }
    })

    val filePath = "file:///Users/lipan/workspace/flink_demo/flink-local-train/src/main/resources/sink/scala/"
    info.writeAsText(filePath, WriteMode.OVERWRITE).setParallelism(2)
    val jobResult = env.execute("CounterApp")

    // step3: 获取计数器
    val num = jobResult.getAccumulatorResult[Long]("ele-counts-scala")

    println("num: " + num)
  }

}

 

Published 87 original articles · won praise 69 · views 130 000 +

Guess you like

Origin blog.csdn.net/lp284558195/article/details/103764646