Spark related operations on hadoop nodes

Spark RDDResilient DistributedDataset) 运算 

 In the masterer of multi-node hadoop

登录spark    spark-shell --master local[*]

1 Build intRDD and convert to Array

val intRDD = sc.parallelize(List(3,1,2,5,5))

intRDD.collect()

 

2 建⽴stringRDD并转换为Arrayval stringRDD =sc.parallelize(List(“Apple","Orange","Banana","Grape","Apple"))stringRDD.collect()



 

 

(1) Named function ( note input line by line ) def addone(x:Int):Int={ | return (x + 1) | } intRDD.map(addone).collect()



 

 

 

(2)  anonymous function intRDD.map(x => x + 1).collect() (3) anonymous function + anonymous parameter intRDD.map(_ + 1).collect()


 

 

4 map string operation stringRDD.map(x=>"fruit:" + x).collect()
 

 

5 filter numeric operations intRDD.filter(x => x < 3).collect() intRDD.filter(_ < 3).collect()

 

 

6 filter string operation stringRDD.filter(x => x.contains("ra")).collect()
 

 

7 distinct运算intRDD.distinct().collect()stringRDD.distinct().collect()

 

 

8 randomSplit运算val sRDD = intRDD.randomSplit(Array(0.4,0.6))sRDD.sizesRDD(0).collect()sRDD(1).collect()




9 groupBy运算val gRDD = intRDD.groupBy(x => {if(x % 2 == 0) "even"else “odd"}).collect()gRDD(0)gRDD(1)



 

 

More than 10 RDD conversion operations val intRDD1 = sc.parallelize(List(3,1,2,5,5)) val intRDD2 = sc.parallelize(List(5,6)) val intRDD3 = sc.parallelize(List(2, 7)) (1) union union operation intRDD1.union(intRDD2).union(intRDD3).collect() (intRDD1++ intRDD2++ intRDD3).collect() (2) intersection operation intRDD1.intersection(intRDD2).collect() (3) subtract difference operation intRDD1.subtract(intRDD2).collect() (4) cartesian Cartesian product operation











11 RDD basic action operation (1) read operation intRDD.first() intRDD.take(2) intRDD.takeOrdered(3) intRDD.takeOrdered(3).(Ordering[Int].reverse) (2) statistical operation intRDD. stats() intRDD.min() intRDD.max() intRDD.stdev() intRDD.count() intRDD.sum() intRDD.mean()












 

 

 

 

 

 

 

12 RDD Key-Value basic conversion operation val kvRDD1 = sc.parallelize(List((3,4),(3,6),(5,6), (1,2)))

// View the key kvRDD1.keys.collect()

//View the value kvRDD1.values.collect()

//kvRDD1.filter{case(key,value) => key < 5}.collect() if the key is less than 5

// kvRDD1.filter{case(key,value) => value < 5}.collect() if the value is less than 5

//Perform map operation kvRDD1.mapValues(x => x*x).collect()

/ / Sort according to the size of the keyword default true kvRDD1.sortByKey(true).collect() kvRDD1.sortByKey().collect() kvRDD1.sortByKey(false).collect()


//The same key is added up kvRDD1.reduceByKey((x,y)=>x+y).collect()

// The operation method of reduce shorthand kvRDD1.reduceByKey(_+_).collect() More than 13 RDD Key-Value conversion operations val kvRDD1 = sc.parallelize(List((3,4),(3,6),( 5,6), (1,2))) val kvRDD2 = sc.parallelize(List((3,8)))




//Link and print the same key to match kvRDD1.join(kvRDD2).foreach(println)

//Left link kvRDD1.leftOuterJoin(kvRDD2).foreach(println)

//Right link kvRDD1.rightOuterJoin(kvRDD2).foreach(println)

//Key-value pair difference operation kvRDD1.subtract(kvRDD2).collect() 14 Key-Value action operation kvRDD1.first() kvRDD1.take(2) val kvFirst = kvRDD1.first kvFirst._1 kvFirst._2






//Count the number of keys kvRDD1.countByKey()

//map operation val KV=kvRDD1.collectAsMap() KV(3) KV(1)


//All keywords are the value of 3 kvRDD1.lookup(3) kvRDD1.lookup(5) 15 Broadcast broadcast variable (shared always on) ( 1 ) When the Boradcast broadcast variable is not used val kvFruit=sc.parallelize( List((1, "apple"), (2,"orange"),(3, "banana"),(4, "grape"))) val fruitMap = kvFruit.collectAsMap() val fruitIds=sc.paralelize( List(2,4,1,3)) val fruitNames=fruitIds.map(x>=fruitMap(x)).collect ( 2 ) When using Boradcast broadcast variables val kvFruit=sc.parallelize(List((1 , "apple"), (2,"orange"),(3, "banana"),(4, "grape"))) val fruitMap = kvFruit.collectAsMap() val bcFruitMap=sc.broadcast(fruitMap) val fruitIds = sc.parall














elize(List(2,4,1,3)) val fruitNames = fruitIds.map(x -> bcFruitMap.value(x)).collect 16 accumulator val intRDD = sc.parallelize(List(3,1,2 ,5,5)) val total = sc.accumulator(0.0) val num = sc.accumulator(0) intRDD.foreach(i=>{ total += i num += 1}) println(“total=”+total .value+“, num=”+num.value) val avg=total.value / num.value 17 RDD Persistence ( 1) Build RDD example val intRddMemory = sc.parallelize(List(3,1,2,5 ,5)) intRddMemory . persist() intRddMemor y.unpersist () (2) Set the storage level

















import org.apache.spark.storage.StorageLevelval intRddMemoryAndDisk = sc.parallelize(List(3,1,2,5,5))intRddMemoryAndDisk.persist(StorageLevel.MEMORY_AND_DISK)intRddMemoryAndDisk.unpersist()18 使⽤Spark建⽴WordCount



Cd workspace/

rm -R wordCount/

Cd


mkdir -p ~/workspace/WordCount/datacd ~/workspace/WordCount/datagedit test.txt

spark    spark-shell --master local[*]


Apple Apple Orange Banana Grape Grape ( 1 ) read local file val textFile=sc.textFile("file:/home/zwxq/workspace/WordCount/data/test.txt ) ( 2 ) read each word val stringRDD =textFile.flatMap(line=>line.split(" "))





Options

Persistent Storage Class

MEMORY_ONLY

Store the RDD as a deserialized object in the JVM. If the RDD cannot fit in memory , some partitions will not be cached and will be recomputed when needed.

MEMORY_AND_DISK

Store the RDD as a deserialized object in the JVM. If the RDD cannot be mounted with memory , the excess partitions will be saved on the hard disk and read when needed

MEMORY_ONLY_SER

Store RDDs as serialized objects (one byte array ). In general, this is more space efficient than deserializing objects, especially when using fast serializers, but will be more CPU-intensive when reading

MEMORY_AND_DISK_SER

Similar to MEMORY_ONLY_SER, but stores out-of-memory partitions on disk instead of recomputing them each time they are needed

DISK_ONLY

Store only RDD partitions on the hard drive

MEMORY_ONLY_2MEMORY_AND_DISK_2

Same storage level as above, but replicates each partition across two cluster nodes



( 3 ) Create a Key-Value pair and calculate reduce val countsRDD=stringRDD.map(word=>(word, 1).reduceByKey(_+_) ( 4 ) Store the calculation result countsRDD.saveAsTestFile(“file:/home/ zwxq/workspace/ WordCount/data/output”) exit ll cd output ll ( 5 ) to see the result cat part-00000










 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324609298&siteId=291194637