RDDs are immutable, fault-tolerant,parallel data structures that let users explicitly persist intermediate results in memory,control their partitioning to optimize data placement, and manipulate them using a rich set of operators. (个人觉得英文写的很清楚,不必翻译了)
2. RDD的特性
Immutable (不可变的)
Fault Tolerant(容错性)
Parallel Data Structures(并行的数据结构)
In-Memory Computing(内存计算)
Data Partitioning and Placement(数据分区和存放)
Rich Set of Operations (丰富的操作方法)
3. RDD的操作
transformation vs action ?
In short, RDDs are immutable, RDD transformations are lazily evaluated, and RDD actions are eagerly evaluated and trigger the computation of your data processing logic.(个人觉得英文写的很清楚,不必翻译了) .
4. RDD的创建方式
总的来说有三种
从创建的Object collection 中
从外部文件例如 HDFS中获取
从其他的RDD中通过transformation 获取
案例1. 从创建的Object collection 中 获取RDD
def main(args: Array[String]): Unit ={
val conf =newSparkConf();
conf.setMaster("local[2]");
conf.setAppName("create RDD by Array Object");
conf.set("spark.testing.memory","2147480000");
val str_list = Array[String]("How to create RDD","funny game","Spark is cool!");
val sc =newSparkContext(conf);
val RDD_str = sc.parallelize(str_list,2);print(RDD_str.count());}
案例2. 从HDFS 中 获取RDD
def main(args: Array[String]): Unit ={//初始化conf配置
val sparkConf =newSparkConf().setAppName("WordCount sample")// .setMaster("192.168.1.10:4040").setMaster("local[2]").set("spark.testing.memory","2147480000");
val sc =newSparkContext(sparkConf);
val rdd = sc.textFile("/user/hadoop/worddir/word.txt");
val tupleRDD = rdd.flatMap(line =>{line.split(" ").toList.map(word =>(word.trim,1))});
val resultRDD :RDD[(String,Int)]=tupleRDD.reduceByKey((a,b)=> a + b);
resultRDD.foreach(elm =>println(elm._1+"="+elm._2));
Thread.sleep(10000);
sc.stop();}
案例3. 从其他的RDD中获取RDD
def main(args: Array[String]): Unit ={
val conf =newSparkConf();
conf.setAppName("Get RDD from existed RDD");
conf.setMaster("local[2]");
conf.set("spark.testing.memory","2147480000");//conf.set("","")
val str_array =Array("one","two","three");
val sc =newSparkContext(conf);
val RDD_list = sc.parallelize(str_array,2);
val RDD_list1 = RDD_list.map(l=>{l+" ******"});
RDD_list1.foreach(elm=>{println(elm)});}
5. RDD transformation 操作
一些常用的RDD操作如下
名称
描述
map(func)
this applies the provided function to each row as iterating through the rows in the dataset. the returned rDD will contain whatever the provided func returns
flatMap(func)
similar to map(func), the func should return a collection rather than a single element, and this method will flatten out the returned collection. this allows an input item to map to zero or more output items.
filter(func)
Only the elements that the func function returns true will be collected in the returned rDD. in other words, collect only the rows that meet the condition defined in the given func function.
mapPartitions(func)
similar to map(func), but this applies at the partition (chunk) level. this requires the func function to take the input as an iterator to iterate through each row in the partition.
mapParitionsWithIndex(func)
this is similar to mapPartitions, but an additional partition index number is provided to the func function.
mapParitionsWithIndex(func)
this is similar to mapPartitions, but an additional partition index number is provided to the func function.
union(otherRDD)
this is similar to mapPartitions, but an additional partition index number is provided to the func function.
intersection(otherRDD)
Only the rows that exist in both the source rDD and Only the rows that exist in both the source rDD and
substract(otherRDD)
this subtracts the rows in otherRDD from the source rDD.
distinct([numTasks])
this removes duplicate rows from the source rDD.
sample(withReplace, fraction,seed)
this is usually used to reduce a large dataset to a smaller one by randomly selecting a fraction of rows using the given seed and with or without replacements.
object RDDTest08 {
def main(args: Array[String]): Unit ={
val conf =newSparkConf();
conf.setMaster("local[2]");
conf.setAppName("RDD map transformation");
conf.set("spark.testing.memory","2147480000");// define class PersoncaseclassPerson(id:Int,name:String,phoneNum:String);
val personArray =Array("1#Jason#1233242432","2#Mike#1232131312");
val sc =newSparkContext(conf);
val personRDD = sc.parallelize(personArray);
val personObjectRDD = personRDD.map(person=>{
val personInfo = person.split("#");Person(personInfo(0).toInt,personInfo(1),personInfo(2));})
personObjectRDD.collect.foreach(println);}
val strArray =Array("this is one","this is two");
val sc =newSparkContext(conf);
val strRDD = sc.parallelize(strArray);
val resultRDD = strRDD.map(line=>{
line.split(" ")})
val array:Array[Array[String]]= resultRDD.collect();
array.foreach(print);
flatMap 返回的类型是 Array[String]
val strArray =Array("this is one","this is two");
val sc =newSparkContext(conf);
val strRDD = sc.parallelize(strArray);
val resultRDD = strRDD.flatMap(line=>{
line.split(" ")})
val array:Array[String]= resultRDD.collect();
array.foreach(print);
案例3. filter(func) : 基于行的过滤,返回是RDD的结构
val strArray =Array("this is one","this is two");
val sc =newSparkContext(conf);
val strRDD = sc.parallelize(strArray);
val resultRDD = strRDD.filter(line=>{line.contains("two")});
resultRDD.foreach(print);
案例4. mapPartitions(func): 将map 进行分区处理
caseclassPerson(id:Int,name:String,phoneNum:String);
val personArray =Array("1#Jason#1233242432","2#Mike#1232131312","3#James#01902992888","4#Tom#1231232222");
val sc =newSparkContext(conf);
val personRDD = sc.parallelize(personArray,2);
val personObjectRDD = personRDD.mapPartitions((iter: Iterator[String])=>{
iter.map(person=>{
val personInfo = person.split("#");Person(personInfo(0).toInt,personInfo(1),personInfo(2))});});
personObjectRDD.collect.foreach(println);
案例5. mapPartitionsWithIndex: 将map 进行分区处理(带上分区键)
caseclassPerson(id:Int,name:String,phoneNum:String,key:Int);
val personArray =Array("1#Jason#1233242432","2#Mike#1232131312","3#James#01902992888","4#Tom#1231232222");
val sc =newSparkContext(conf);
val personRDD = sc.parallelize(personArray,2);
val personObjectRDD = personRDD.mapPartitionsWithIndex((idx:Int,iter: Iterator[String])=>{
iter.map(person=>{
val personInfo = person.split("#");Person(personInfo(0).toInt,personInfo(1),personInfo(2),idx);});});
personObjectRDD.collect.foreach(println);
val intArray1 =Array(0,1,3,5,7,9);
val intArray2 =Array(0,2,4,6,8,10);
val sc =newSparkContext(conf);
val intRDD1 = sc.parallelize(intArray1);
val intRDD2 = sc.parallelize(intArray2);
val unionRDD = intRDD1.union(intRDD2);println(unionRDD.collect().toList);
输出结果
List(0,1,3,5,7,9,0,2,4,6,8,10)
案例6. intersection(otherRDD) : 交集操作
val sc =newSparkContext(conf);
val strRDD1 = sc.parallelize(Array("one","two","three"));
val strRDD2 = sc.parallelize(Array("two","three"));
val intersectionRDD = strRDD1.intersection(strRDD2);println(intersectionRDD.collect().toList);
输出结果
List(two, three)
案例7. substract(otherRDD) : 差集操作
val sc =newSparkContext(conf);
val strRDD1 = sc.parallelize(Array("one","two","three"));
val strRDD2 = sc.parallelize(Array("two","three"));
val subtractRDD = strRDD1.subtract(strRDD2);println(subtractRDD.collect().toList);
输出结果
List(one)
案例8. distinct( ) 去除重复的值
val sc =newSparkContext(conf);
val duplicatedRDD = sc.parallelize(List("one",1,"two",2,"three",3,"four","four"));print(duplicatedRDD.distinct().collect().toList);
val sc =newSparkContext(conf);
val intArray= sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10));print(intArray.sample(false,0.2).collect().toList);
输出结果
List(1,4)
6. RDD action 操作
案例1. collect( ): 把RDD转换为集合
***注意如果你的结果集过大,会引起out-of-memory
***最好把结果集清洗过滤之后再转换
val sc =newSparkContext(conf);print(sc.parallelize(Array(1,2,3,4,5)).collect().toList);
输出结果
List(1,2,3,4,5)
案例2. count( ): 计数
val sc =newSparkContext(conf);print(sc.parallelize(Array(1,2,3,4,5)).count());
输出结果
5
案例3. first( ): 取第一个
val sc =newSparkContext(conf);print(sc.parallelize(Array(1,2,3,4,5)).first());
输出结果
1
案例4. take(n): 取N个元素
val sc =newSparkContext(conf);
sc.parallelize(Array(1,2,3,4,5)).take(3).foreach(print);
输出结果
123
案例6. reduce(func): 合并数据处理
val sc =newSparkContext(conf);
val sumRDD = sc.parallelize(Array(1,2,3,4,5)).reduce((a1:Int,a2:Int)=>{a1+a2});print(sumRDD.toInt);
输出结果
15
案例7. takeSample(withReplacement, n, [seed])
和transform 的sample 类似
val sc =newSparkContext(conf);print(sc.parallelize(Array(1,2,3,4,5)).takeSample(false,2).toList);
输出结果
List(1,5)
案例8. takeOrdered(n, [ordering])
排序输出
val sc =newSparkContext(conf);
sc.setLogLevel("ERROR");println(sc.parallelize(Array(1,2,3,4,5)).takeOrdered(3)(Ordering[Int].reverse).toList);println(sc.parallelize(Array(1,2,3,4,5)).takeOrdered(3).toList);
输出结果
List(5,4,3)List(1,2,3)
案例9. top(n, [ordering])
输出top N个元素
val sc =newSparkContext(conf);
sc.setLogLevel("ERROR");println(sc.parallelize(Array(1,2,3,4,5)).top(3).toList);println(sc.parallelize(Array(1,2,3,4,5)).top(3)(Ordering[Int]).toList);