【SparkAPI JAVA版】JavaPairRDD——countByKey、countByKeyApprox（十二）

JavaPairRDD的countByKey方法讲解

官方文档

/**
   * Count the number of elements for each key, collecting the results to a local Map.
   *
   * @note This method should only be used if the resulting map is expected to be small, as
   * the whole thing is loaded into the driver's memory.
   * To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
   * returns an RDD[T, Long] instead of a map.
   */

说明

计算每个键的元素数，将结果放到Map中去。
注意：
只有当数据量很小时，才应使用此方法，因为整个数据都被载入内存中。
如果要处理大量数据，请考虑使用rdd.mapValues(_ => 1L).reduceByKey(_ + _)，
返回的结果是 RDD[T, Long] 而不是Map。

函数原型

// java
public java.util.Map<K,Long> countByKey()
// scala
def countByKey(): Map[K, Long]

示例

public class CountByKey {
    public static void main(String[] args) {
        System.setProperty("hadoop.home.dir", "E:\\hadoop-2.7.1");
        SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Spark_DEMO");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);

        JavaPairRDD<String, String> javaPairRDD1 = sc.parallelizePairs(Lists.newArrayList(
                new Tuple2<String, String>("cat", "11"), new Tuple2<String, String>("dog", "22"),
                new Tuple2<String, String>("cat", "33"), new Tuple2<String, String>("pig", "44"),
                new Tuple2<String, String>("duck", "55"), new Tuple2<String, String>("cat", "66")), 3);

        Map<String,Long> key = javaPairRDD1.countByKey();
        for (Map.Entry<String,Long> entry : key.entrySet()){
            System.out.println(entry.getKey()+":"+entry.getValue());
        }
    }
}

结果

19/03/20 16:36:11 INFO DAGScheduler: ResultStage 1 (countByKey at CountByKey.java:23) finished in 0.093 s
19/03/20 16:36:11 INFO DAGScheduler: Job 0 finished: countByKey at CountByKey.java:23, took 1.229949 s
duck:1
cat:3
dog:1
pig:1
19/03/20 16:36:11 INFO SparkContext: Invoking stop() from shutdown hook

JavaPairRDD的countByKeyApprox方法讲解

官方文档

/**
   * Approximate version of countByKey that can return a partial result if it does
   * not finish within a timeout.
   *
   * The confidence is the probability that the error bounds of the result will
   * contain the true value. That is, if countApprox were called repeatedly
   * with confidence 0.9, we would expect 90% of the results to contain the
   * true count. The confidence must be in the range [0,1] or an exception will
   * be thrown.
   *
   * @param timeout maximum time to wait for the job, in milliseconds
   * @param confidence the desired statistical confidence in the result
   * @return a potentially incomplete result, with error bounds
   */

说明

CountByKey的近似版本，如果没有在规定时间内完成就返回部分结果。

@参数超时等待作业的最长时间（毫秒）
@参数置信度结果中所需的统计置信度
@返回一个可能不完整的结果，带有错误界限

函数原型

// java
public PartialResult<java.util.Map<K,BoundedDouble>> countByKeyApprox(long timeout)
public PartialResult<java.util.Map<K,BoundedDouble>> countByKeyApprox(long timeout,
                                                                      double confidence)
// scala
def countByKeyApprox(timeout: Long): PartialResult[Map[K, BoundedDouble]]
def countByKeyApprox(timeout: Long, confidence: Double = 0.95): PartialResult[Map[K, BoundedDouble]]

【SparkAPI JAVA版】JavaPairRDD——countByKey、countByKeyApprox（十二）

JavaPairRDD的countByKey方法讲解

官方文档

说明

函数原型

示例

结果

JavaPairRDD的countByKeyApprox方法讲解

官方文档

说明

函数原型

猜你喜欢