Spark (Python version) Zero Basic Study Notes (3) - Summary and Examples of Spark Actions

The Actions operator is a type of Spark operator, which triggers the SparkContext to submit a job. The following introduces the commonly used actions supported by Spark.

1. reduce( func )
uses the function func (two input parameters, return one value) to aggregate the elements in the dataset. The function func must be commutative (what I understand is that the exchange of two parameters has no effect on the result), and it must be associated, so that parallel computing can be performed correctly.

>>> data = sc.parallelize(range(1,101))
>>> data.reduce(lambda x, y: x+y)
5050

2. collect()
returns all the elements in the data set in the form of an array in the driver program. This is often useful when an action returns a smaller subset of data after performing a filter or other operation.

>>> data = sc.parallelize(range(1,101))
>>> data.filter(lambda x: x%10==0).collect()
[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

3. count()
returns the number of elements in the dataset.

>>> data.count()
100

4. first()
returns the first element in the dataset, equivalent to take(1).

>>> data.first()
1

5. take()
returns the first n elements of the dataset in the form of an array. It should be noted that this action is not executed on multiple nodes in parallel, but executed on a single machine on the machine where the driver program is located, which will increase the memory pressure and should be used with caution.

>>> data.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

6. takeSample( withReplacement , num , [ seed ]) returns a random sample of num
samples from the dataset in the form of an array , with or without replacement. The optional parameter [ seed ] allows the user to predefine the seed of the random number generator.

>>> data.takeSample(False, 20)
[60, 97, 91, 62, 48, 7, 49, 89, 40, 44, 15, 2, 33, 8, 30, 82, 87, 96, 32, 31]   
>>> data.takeSample(True, 20)
[96, 71, 20, 71, 80, 42, 70, 93, 77, 26, 14, 82, 50, 30, 30, 56, 93, 46, 70, 70]

7. takeOrdered( n , [ ordering ])
returns the first n elements of the RDD, either using natural ordering or a comparator where the ordering is performed by the user.

>>> score = [('Amy',98),('Bob',87),('David',95),('Cindy',76),('Alice',84),('Alice',33)]
>>> scoreRDD = sc.parallelize(score)
>>> scoreRDD.takeOrdered(3)
[('Alice', 33), ('Alice', 84), ('Amy', 98)]  #可以根据两个Alice的例子看到,当元祖中第一个元素相同时,会继续比较第二个元素,仍然按升序排列
>>> scoreRDD.takeOrdered(3, key=lambda x: x[1])  #按照分数升序排序
[('Alice', 33), ('Cindy', 76), ('Alice', 84)]
>>> scoreRDD.takeOrdered(3, key=lambda x: -x[1])  #按照分数降序排序
[('Amy', 98), ('David', 95), ('Bob', 87)]

Note that the second parameter here is an anonymous function. This anonymous function does not change the value in scoreRDD, that is, in the third example, it does not change the score of each person into a negative number, but provides a basis for sorting , indicating that the sorting is in descending order at this time. If you want to change the value in the RDD, you can do the following:

>>> scoreRDD.map(lambda x: (x[0], -x[1])).takeOrdered(3, lambda x: x[1])
[('Amy', -98), ('David', -95), ('Bob', -87)]

This example has no practical significance, just reminding the role of the second parameter in the takeOrdered operator.

8. saveAsTextFile( path )
writes the elements in the dataset in the form of a text file (or a collection of text files) to the local file system, or to the specified path path of HDFS, or other file systems supported by Hadoop. Spark will call the toString method of each element to convert it to a line in the text file.

9. saveAsSequenceFile( path )
writes the elements in the dataset in the form of Hadoop SequenceFile to the local file system, or to the specified path path of HDFS, or other file systems supported by Hadoop. The elements of an RDD must consist of key-value pairs that implement Hadoop's Writable interface. In Scala, it can also be a key-value pair that can be implicitly converted to Writable (Spark includes conversion of basic types, such as Int, Double, String, etc.)

10. saveAsObjectFile( path )
uses Java serialization to write elements in the dataset in a simple form, and can use SparkContext.objectFile() to load data. (for Java and Scala)

11. countByKey()
can only work on RDDs in the form of key-value pairs (K, V). Count by Key and return a hash table of key-value pairs (K, int).

>>> score = [('Amy',98),('Bob',87),('David',95),('Cindy',76),('Alice',84),('Alice',33)]  #一组学生对应的成绩
>>> scoreRDD = sc.parallelize(score)
>>> scoreRDD.countByKey()
defaultdict(<class 'int'>, {'Cindy': 1, 'Alice': 2, 'Bob': 1, 'Amy': 1, 'David': 1})  
>>> result = scoreRDD.countByKey()
>>> type(result)  #查看返回值类型
<class 'collections.defaultdict'> 
>>> result['Alice']
2
>>> result['Sunny']
0
>>> testDict = {'Cindy': 1, 'Alice': 2, 'Bob': 1, 'Amy': 1, 'David': 1}
>>> testDict['Sunny']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'Sunny'

! ! ! In particular, what is returned in Pyspark is a collections.defaultdict() class. Collections is a module of python and a data type container module. It should be noted that there is still a difference between defaultdict and dict. defaultdict is a subclass of Python's built-in function dict. It builds a dictionary-like object, where the value of key is assigned by itself, but the type of value is a class instance of function_factory (factory function), even for a key, it If the value is missing, there will also be a default value. The last example in the above code can be read. Although there is no key-value pair with the key 'Sunny' in result and testDict, result will return a default value of 0, and testDict has a KeyError error. Regarding the difference between defaultdict and dict, I won't explain too much here, but you need to note that the type returned here is not dict.

12. foreach( func ) calls the function func
on each element of the dataset . This operation is usually done to achieve some side effect, such as updating an accumulator or interacting with an external storage system. Note: Modifying variables other than accumulators outside of foreach() may cause some undefined behavior. See Closures for more details.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325380614&siteId=291194637