Spark control operator

The concept:
control operators have three, cache, persist, checkpoint, more operators can be RDD persistence, persistence unit is the partition. cache and persist are lazy execution. There must be a class action operator to trigger execution. RDD checkpoint operator can not only persisted to disk, but also cut off the dependencies between the RDD.

  • cache
    by default RDD data persistent memory. cache is lazy execution.
    Note: cache () = persist () = persist (StorageLevel.Memory_Only)
    test code:
 SparkConf conf = new SparkConf();
 conf.setMaster("local").setAppName("CacheTest");
 JavaSparkContext jsc = new JavaSparkContext(conf);
 JavaRDD<String> lines = jsc.textFile("./NASA_access_log_Aug95");

 lines = lines.cache();
 long startTime = System.currentTimeMillis();
 long count = lines.count();
 long endTime = System.currentTimeMillis();
 System.out.println("共"+count+ "条数据,"+"初始化时间+cache时间+计算时间="+ 
          (endTime-startTime));
		
 long countStartTime = System.currentTimeMillis();
 long countrResult = lines.count();
 long countEndTime = System.currentTimeMillis();
 System.out.println("共"+countrResult+ "条数据,"+"计算时间="+ (countEndTime-
           countStartTime));
		
 jsc.stop();

  • persist
    can specify the level of persistence. The most commonly used is MEMORY_ONLY and MEMORY_AND_DISK. "_2" expressed the number of copies.
    Persistent levels are as follows:
    Here Insert Picture Description
    Cache and persist Notes :
    1.cache persist and are lazy execution, there must be a class action operator to trigger execution.
    2.cache and persist operator return value can be assigned to a variable, use the variable directly in the other job it is to use the persistence of the data. Persistence unit is partition.
    3.cache and persist after the operator can not keep up with immediate action operator.
    4.cache and persist operator persistent data when execution is completed after applilcation will be cleared.

Error:. Rdd.cache () count () returns the RDD is not persistent, but a number of.

- checkpoint
checkpoint will RDD persisted to disk, you can also cut the dependency between the RDD. checkpoint catalog data when executing the application will not be cleared.
the implementation of the principle checkpoint:
1. When the job after the implementation of RDD will back back to front from finalRDD.
2. When the call back to the checkpoint one RDD method, the current RDD will make a mark.
3.Spark frame will automatically start a new job, re-computing data of the RDD, the data is persisted to the HDFS.
Optimization : RDD prior to execution checkpoint, best to perform this RDD cache, copy this data only need to restart the job in memory to the HDFS can, eliminating the need to recalculate this step.
Use :

 SparkConf conf = new SparkConf();
 conf.setMaster("local").setAppName("checkpoint");
 JavaSparkContext sc = new JavaSparkContext(conf);
 sc.setCheckpointDir("./checkpoint");
 JavaRDD<Integer> parallelize = sc.parallelize(Arrays.asList(1,2,3));
 parallelize.checkpoint();
 parallelize.count();
 sc.stop();

Published 18 original articles · won praise 2 · Views 375

Guess you like

Origin blog.csdn.net/CH_Axiaobai/article/details/104173244