大数据实时计算Spark学习笔记(7)—— RDD 数据倾斜处理

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u012292754/article/details/85316579

1 处理数据倾斜

  • 在 reduceByKey 之前先进行随机分区
package com.bigdataSpark.cn
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.Random
object DataLeanDemo {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setMaster("local[4]").setAppName("DataLean")

        val sc = new SparkContext(conf)

        val rdd1 = sc.textFile("d:/words.txt", 4)
        rdd1.flatMap(_.split(" ")).map((_, 1)).map(t => {
            val word = t._1
            val r = Random.nextInt(100)
            (word + "_" + r, 1)
        }).reduceByKey(_ + _).map(t => {
            val word = t._1
            val count = t._2
            val w = word.split("_")(0)
            (w, count)
        }).reduceByKey(_ + _)
                .saveAsTextFile("d:/Scalaout/lean")
    }
}

2 Spark 和 Hadoop HA整合

  • 复制 core-site.xml + hdfs-site.xmlspark/conf

猜你喜欢

转载自blog.csdn.net/u012292754/article/details/85316579