sample(withReplacement : scala.Boolean, fraction : scala.Double,seed scala.Long)
sample算子时用来抽样用的,其有3个参数
withReplacement:表示抽出样本后是否在放回去,true表示会放回去,这也就意味着抽出的样本可能有重复
fraction :抽出多少,这是一个double类型的参数,0-1之间,eg:0.3表示抽出30%
seed:表示一个种子,根据这个seed随机抽取,一般情况下只用前两个参数就可以,那么这个参数是干嘛的呢,这个参数一般用于调试,有时候不知道是程序出问题还是数据出了问题,就可以将这个参数设置为定值
下面是代码:
大概思路是:通过抽样取出一部分样本,在对样本做wordCount并排序最后取出出现次数最多的key,这个key就是导致数据倾斜的key
package com.lyzx.spark.streaming;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Day05 {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("Day05");
JavaSparkContext jsc = new JavaSparkContext(conf);
List<String> keys = getKeyBySample(jsc);
System.out.println("导致数据倾斜的key是:"+keys);
jsc.stop();
}
/**
* 通过Sample算子进行抽样并把导致数据倾斜的key找出来
* 然后可以做对计算做针对性的优化
* @param jsc
*/
public static List<String> getKeyBySample(JavaSparkContext jsc){
List<String> data = Arrays.asList("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","C","D","E","F","G");
JavaRDD<String> rdd = jsc.parallelize(data,2);
List<Tuple2> item =
rdd.mapToPair(x->new Tuple2<String,Integer>(x,1))
.sample(true,0.4)
.reduceByKey((x,y)->x+y)
.map(x->new Tuple2(x._2,x._1))
.sortBy(x->x._1,false,2)
.take(3);
List<String> keys = new ArrayList<>();
System.out.println("keys="+item);
for(int i=0;i<item.size();i++){
if(i == item.size()-1)
break;
Tuple2 current = item.get(i);
Tuple2 next = item.get(i+1);
Integer v1 = Integer.parseInt(current._1.toString());
Integer v2 = Integer.parseInt(next._1.toString());
System.out.println(v1+" "+v2);
/**
* 这儿的逻辑有问题,找出导致数据倾斜的key的方式和具体的业务也有关系
* 这里只是给了一个简单的判断方法,很有局限性
*/
if(v1/v2 >= 3){
System.out.println("===");
keys.add(current._2.toString());
}
}
return keys;
}
}
原文:https://blog.csdn.net/lyzx_in_csdn/article/details/79948799