版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u012292754/article/details/85316579
1 处理数据倾斜
- 在 reduceByKey 之前先进行随机分区
package com.bigdataSpark.cn
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.Random
object DataLeanDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[4]").setAppName("DataLean")
val sc = new SparkContext(conf)
val rdd1 = sc.textFile("d:/words.txt", 4)
rdd1.flatMap(_.split(" ")).map((_, 1)).map(t => {
val word = t._1
val r = Random.nextInt(100)
(word + "_" + r, 1)
}).reduceByKey(_ + _).map(t => {
val word = t._1
val count = t._2
val w = word.split("_")(0)
(w, count)
}).reduceByKey(_ + _)
.saveAsTextFile("d:/Scalaout/lean")
}
}
2 Spark 和 Hadoop HA整合
- 复制
core-site.xml
+hdfs-site.xml
到spark/conf