Chapter 2: Summary of "Spark-----RDD Programming"

1. Ordinary RDD

One, load data from the file system to create an RDD
1. Read local files:

scala> val lines = sc.textFile("file:///usr/local/spark/mycode/rdd/word.txt")

2. Read files in HDFS

scala> val lines = sc.textFile("hdfs://localhost:9000/user/hadoop/word.txt")

2. Create RDD through parallel collection (array)
Create from the array:

scala>val array = Array(1,2,3,4,5)
scala>val rdd = sc.parallelize(array)

Or, you can create from the list:

scala>val list = List(1,2,3,4,5)
scala>val rdd = sc.parallelize(list)


Three, RDD operation
1. Conversion operation:

  • filter(func): Filter out the elements that satisfy the function func, and return a new data set
  • map(func): Pass each element to the function func, and return the result as a new data set
  • flatMap(func): similar to map(), but each input element can be mapped to 0 or more output results
  • groupByKey(): When applied to a data set of (K, V) key-value pairs, it returns a new (K, Iterable) data set
  • reduceByKey(func): When applied to a data set of (K, V) key-value pairs, it returns a new (K, V) form of data set, where each value is passed to the function func for each key polymerization

    Insert picture description here
    Insert picture description here

    Insert picture description here
    Insert picture description here
    Insert picture description here
    Insert picture description here
scala> val lines = sc.textFile("file:///usr/local/spark/mycode/rdd/word.txt")
scala> lines.map(line => line.split(" ").size).reduce((a,b) => if (a>b) a else b)

The reduce() operation receives two parameters at a time, takes the larger one and keeps it, and then continues the comparison until the maximum value is kept.

Insert picture description here
Insert picture description here

Insert picture description here

Insert picture description here
Insert picture description here
2. Action operation

  • count() returns the number of elements in the data set
  • collect() returns all elements in the data set in the form of an array
  • first() returns the first element in the data set
  • take(n) returns the first n elements in the data set in the form of an array
  • reduce(func) aggregates the elements in the data set through the function func (input two parameters and return a value)
  • foreach(func) Pass each element in the data set to the function func to run*
    Insert picture description here
    Insert picture description here


Four, endurance
The parentheses of persist() contain the persistence level parameters, for example,

  • If persist(MEMORY_ONLY) is used, if the memory is insufficient, the content in the cache must be replaced according to the LRU principle.
  • persist(MEMORY_AND_DISK) If the memory is insufficient, the excess partition will be stored on the hard disk.
  • RDD.cache () = RDD.persist (MEMORY_ONLY)
scala> val list = List("Hadoop","Spark","Hive")
scala> val rdd = sc.parallelize(list)
scala> rdd.cache()  //会调用persist(MEMORY_ONLY)
scala> println(rdd.count()) //第一次行动操作,触发一次真正从头到尾的计算,这时才会执行上面的rdd.cache(),把这个rdd放到缓存中
3
scala> println(rdd.collect().mkString(",")) //第二次行动操作,不需要触发从头到尾的计算,只需要重复使用上面缓存中的rdd
Hadoop,Spark,Hive

Finally, you can use the unpersist() method to manually remove the persistent RDD from the cache.



Examples of word frequency statistics
Insert picture description here
Insert picture description here
Insert picture description here


Six, print elements
Generally, the statement rdd.foreach(println) or rdd.map(println) is used.
When executed in cluster mode, rdd.collect().foreach(println) will print, or rdd.take(100).foreach(println) will capture the first 100 and print.



Two, key-value pair RDD

. One, the creation of key-value pair RDD
①The first way to create: load from a file

scala>  val lines = sc.textFile("file:///usr/local/spark/mycode/pairrdd/word.txt")
scala> val pairRDD = lines.flatMap(line => line.split(" ")).map(word => (word,1))
scala> pairRDD.foreach(println)
(i,1)
(love,1)
 ...

②Create key-value pairs RDD through parallel collection

scala> val list = List("Hadoop","Spark","Hive","Spark")
scala> val rdd = sc.parallelize(list)
scala> val pairRDD = rdd.map(word => (word,1))
scala> pairRDD.foreach(println)
(Hadoop,1)
(Spark,1)
(Hive,1)
(Spark,1)

Two, commonly used key-value pair conversion operations
Commonly used key-value pair conversion operations include reduceByKey(), groupByKey(), sortByKey(), join(), cogroup(), etc.
1.reduceByKey(func)

//有四个键值对(“spark”,1)、(“spark”,1)、(“hadoop”,1)和(“hadoop”,1),对具有相同key的键值对进行合并后的结果就是:(“spark”,2)、(Hive,1)、(“hadoop”,1)
scala> pairRDD.reduceByKey((a,b)=>a+b).foreach(println)
(Spark,2)
(Hive,1)
(Hadoop,1)

2.groupByKey()

//四个键值对(“spark”,1)、(“spark”,2)、(“hadoop”,3)和(“hadoop”,5),采用groupByKey()后得到的结果是:(“spark”,(1,2))和(“hadoop”,(3,5))
scala> pairRDD.groupByKey()
scala> pairRDD.groupByKey().foreach(println)
(Spark,CompactBuffer(1, 1))
(Hive,CompactBuffer(1))
(Hadoop,CompactBuffer(1))

3. The
keys will return the key in the key-value pair RDD to form a new RDD. For an RDD composed of four key-value pairs ("spark",1), ("spark",2), ("hadoop",3) and ("hadoop",5), the result obtained after using keys is an RDD [Int], the content is {"spark","spark","hadoop","hadoop"}.

scala> pairRDD.keys.foreach(println)
Hadoop
Spark
Hive
Spark

4.
values ​​will return the value in the key-value pair RDD to form a new RDD. For example, for an RDD composed of four key-value pairs ("spark",1), ("spark",2), ("hadoop",3) and ("hadoop",5), the result obtained after using keys is An RDD[Int], the content is {1,2,3,5}.

scala> pairRDD.values.foreach(println)
1
1
1
1

5.sortByKey()

scala> pairRDD.sortByKey()
scala> pairRDD.sortByKey().foreach(println)
(Hadoop,1)
(Hive,1)
(Spark,1)
(Spark,1)
//按字母升序

6.sortBy()

scala>val d1=sc.parllelize(Array(('c',8),('c',17),('a',42),('b',4),('d',9),('e',17),('f',29),('g',21),('b',9)))
scala>d1.reduceByKey(_+_).sortByKey(false).collect
res:Array[(String,Int)]=Array((g,21),(f,29),(e,17),(d,9),(c,27),(b,38),(a,42))
scala>val d1=sc.parllelize(Array(('c',8),('c',17),('a',42),('b',4),('d',9),('e',17),('f',29),('g',21),('b',9)))
scala>d1.reduceByKey(_+_).sortBy(_._2,false).collect
res:Array[(String,Int)]=Array((a,42),(b,38),(f,29),(c,27),(g,21),(e,17),(d,9))
//根据value进行降序排列

7.mapValues(func)
Its function is to apply a function to each value in the key-value pair RDD, but the key will not change.

//比如,对四个键值对(“spark”,1)、(“spark”,2)、(“hadoop”,3)和(“hadoop”,5)构成的pairRDD,如果执行pairRDD.mapValues(x => x+1),就会得到一个新的键值对RDD,它包含下面四个键值对(“spark”,2)、(“spark”,3)、(“hadoop”,4)和(“hadoop”,6)。
scala> pairRDD.mapValues(x => x+1)
scala> pairRDD.mapValues(x => x+1).foreach(println)
(Hadoop,2)
(Spark,2)
(Hive,2)
(Spark,2)

8.join
For inner joins, for given two input data sets (K, V1) and (K, V2), only the keys that exist in both data sets will be output, and finally one (K, ( V1, V2)) type data set.
pairRDD1 is a set of key-value pairs {("spark",1), ("spark",2), ("hadoop",3) and ("hadoop",5)}, pairRDD2 is a set of key-value pairs {( "Spark","fast")}, then the result of pairRDD1.join(pairRDD2) is a new RDD, which is a set of key-value pairs {("spark",1,"fast"),(" spark”,2,”fast”)}.

scala> val pairRDD1 = sc.parallelize(Array(("spark",1),("spark",2),("hadoop",3),("hadoop",5)))
scala> val pairRDD2 = sc.parallelize(Array(("spark","fast")))
scala> pairRDD1.join(pairRDD2)
scala> pairRDD1.join(pairRDD2).foreach(println)
(spark,(1,fast))
(spark,(2,fast))

9. A comprehensive example
question: Given a set of key-value pairs ("spark", 2), ("hadoop", 6), ("hadoop", 4), ("spark", 6), the key-value pairs The key represents the book name, and the value represents the book sales volume on a certain day. Please calculate the average value corresponding to each key, that is, calculate the daily average sales volume of each book.

scala> val rdd = sc.parallelize(Array(("spark",2),("hadoop",6),("hadoop",4),("spark",6)))
scala> rdd.mapValues(x => (x,1)).reduceByKey((x,y) => (x._1+y._1,x._2 + y._2)).mapValues(x => (x._1 / x._2)).collect()
res22: Array[(String, Int)] = Array((spark,4), (hadoop,5))

Insert picture description here



Fourth, the data read and write of the file system

1. Data reading and writing of the local file system

scala> val textFile = sc.textFile("file:///usr/local/spark/mycode/wordcount/word.txt")
scala> textFile.first()


How to write the contents of the textFile variable back to another text file wordback.txt:

scala> val textFile = sc.textFile("file:///usr/local/spark/mycode/wordcount/word.txt")
scala> textFile.saveAsTextFile("file:///usr/local/spark/mycode/wordcount/writeback")
//注意,写的时候是指定一个目录,不是一个具体文件,你可以写成textFile.saveAsTextFile("file:///usr/local/spark/mycode/wordcount/writeback.txt")但生成的依旧是一个目录
//写入的这个目录下会包含一些文件,这些文件就是被分区后的数据,如果我们读取一个目录时会读取这个目录下所有的文件(也就是这个目录下所有的数据)

If we want to load the data in the RDD again

scala> val textFile = sc.textFile("file:///usr/local/spark/mycode/wordcount/writeback.txt")
//我们知道weiteback是一个目录,读取一个目录时会读取这个目录下所有的文件(也就是这个目录下所有的数据)

2. Data reading and writing of distributed file system HDFS

scala> val textFile = sc.textFile("hdfs://localhost:9000/user/hadoop/word.txt")
scala> textFile.first()
//执行上面语句后,就可以看到HDFS文件系统中(不是本地文件系统)的word.txt的第一行内容了。

Next, we write the content of textFile back to the HDFS file system (written to the hadoop user directory):

scala> val textFile = sc.textFile("word.txt")
scala> textFile.saveAsTextFile("writeback.txt")
//执行上面命令后,文本内容会被写入到HDFS文件系统的“/user/hadoop/writeback”目录下,该目录下会被生成很多分区的文件存储着各种数据,我们需要再次把writeback.txt中的内容加载到RDD中时,只需要加载这个目录就会加载该目录下所有的文件

When the content in writeback.txt needs to be loaded into RDD again:

scala> val textFile = sc.textFile("hdfs://localhost:9000/user/hadoop/writeback.txt")
//如果我们给textFile()函数传递的不是文件名,而是一个目录,则该目录下的所有文件内容都会被读取到RDD中。

Guess you like

Origin blog.csdn.net/weixin_45014721/article/details/109721502