Spark-Python-Common API

Author: chen_h
WeChat & QQ: 862251340
WeChat public account: coderpai


This article mainly learns the book "Spark Fast Big Data Analysis", and then records some commonly used Python interfaces. Click here for the full version of the interface .


RDD in Spark is an immutable collection of distributed objects, each RDD is divided into multiple partitions, and these partitions run on different nodes in the cluster. Users can create RDDs in two ways: by reading an external dataset, or by distributing collections of objects in the driver program (such as list and set) in the driver program. Once created, RDDs support two types of operations: transformations and actions. Transform operations generate a new RDD from an RDD, such as the filter() function. The action operation computes a result on the RDD and returns the result to the driver program, or stores the result in an external storage system (eg HDFS), such as the first() function. If you cannot determine whether the current function is a transformation operation or an action operation, you can see what the return value of the function is. If it is an RDD, then the function is a transformation operation. If it is another data type, then the A function is an action operation.


API for conversion actions


filter()函数

Function example:

errorsRDD = inputRDD.filter(lambda x: "error" in x)

# 或者
def hasError(line):
  return 'error' in line

errorsRDD = inputRDD.filter(hasError)

What this API does is pick out what errorto . Note that the filter() operation does not change the data in the existing inputRDD. In effect, the operation returns a brand new RDD.


union()函数

Function example:

errorsRDD = inputRDD.filter(lambda line : "error" in line)
warningRDD = inputRDD.filter(lambda line : "warning" in line)
badLinesRDD = errorsRDD.union(warningRDD)

What this API does is compute the union of two RDDs. If there are duplicate elements between two RDDs, the newly generated RDD will also contain duplicate elements.


intersection()函数

Function example:

inputRDD = sc.parallelize([1,2,3,4,5,6,7,8,9])
a = inputRDD.filter(lambda x : x % 2 == 0) # 2,4,6,8
b = inputRDD.filter(lambda x : x > 5) # 6,7,8,9
c = a.intersection(b) # 8,6

What this API does is compute the intersection of two RDDs. The API also removes all duplicate elements at runtime (duplicate elements within a single RDD are also removed together). Although intersection() is conceptually similar to union(), intersection() performs much worse because it needs to shuffle the data through the network to find common elements.


subtract()函数

Function example:

inputRDD = sc.parallelize([1,2,3,4,5,6,7,8,9])
a = inputRDD.filter(lambda x : x % 2 == 0) # 2,4,6,8
b = inputRDD.filter(lambda x : x > 5) # 6,7,8,9
c = a.subtract(b) # 2,4

The role of this API is to calculate the difference of two RDDs, that is, to return an RDD consisting of all elements that exist only in the first RDD but not in the second RDD. Like intersection(), this API also requires data shuffling.


cartesian()函数

Function example:

inputRDD = sc.parallelize([1,2,3,4,5,6,7,8,9])
a=sc.parallelize(['a','b','c','d'])
b = inputRDD.filter(lambda x : x > 5) # 6,7,8,9
c = a.cartesian(b) 
# output
[('a', 6), ('a', 7), ('a', 8), ('a', 9), ('b', 6), ('b', 7), ('b', 8), ('b', 9), ('c', 6), ('c', 7), ('c', 8), ('c', 9), ('d', 6), ('d', 7), ('d', 8), ('d', 9)]

What this API does is compute the Cartesian product of two RDDs. The API transform operation returns all possible (a, b) pairs, where a is an element in the source RDD and b is from another RDD. The Cartesian product is useful when we want to consider the similarity of all possible combinations, such as calculating the expected level of interest of various users in various products. We can also find the Cartesian product of an RDD with itself, which can be used in applications that find similarity between users. However, it should be noted that the Cartesian product of large-scale RDDs is expensive.


map()函数

Function example:

doubleRDD = inputRDD.map(lambda x: x * 2)

The function of this API is to traverse all the elements in the inputRDD, and then return the new RDD with twice as many elements.


flatMap()函数

Function example:

inputRDD = sc.parallelize(['i love you', 'hello world'])
outputRDD = inputRDD.flatMap(lambda x: x.split(' '))
print outputRDD.count() # 5

The role of this API is to be applied to each element in the input RDD, but instead of returning an element, it returns an iterator over a sequence of values. The output RDD is not composed of iterators. What we get is an RDD containing all the elements accessible by each iterator.


distinct()函数

Function example:

inputRDD = sc.parallelize([2,4,3,1,2,3,3,2,1,3,4,2,3,1,4])
distinctRDD = inputRDD.distinct()
dictinctRDD.collect()   # 1,2,3,4

What this API does is to generate a new RDD containing only the distinct elements. However, since this operation requires shuffling all data over the network, all this operation is very time-consuming.


sample()函数

Function example:

inputRDD = sc.parallelize([1,2,3,4,5,6,7,8,9,0])
sampleRDD = inputRDD.sample(False, 0.5)
# 2,3,4,7
sampleRDD = inputRDD.sample(True, 0.5)
# 2,2,2,5,6,6

The function of this API is to randomly collect the data in the RDD. The first parameter indicates whether the elements in the RDD can be collected repeatedly. If True, it means that the data can be collected repeatedly. The second parameter is the probability of whether the element is collected, the value range must be[0,1]


reduceByKey()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
outputRDD = inputRDD.reduceByKey(lambda x,y: x+y)
# output
[(1, 2), (3, 10)]

What this API does is to merge values ​​with the same key.


groupByKey()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
outputRDD = inputRDD.groupByKey(lambda x,y: x+y)
# {(1, [1]), (3,[4,6])}
for (i,j) in outputRDD.collect():
  for item in j:
    print item
# output item
<pyspark.resultiterable.ResultIterable object at 0x110a7ec90>
2
<pyspark.resultiterable.ResultIterable object at 0x110a7ed50>
4
6

What this API does is to group values ​​with the same key.


mapValues()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
outputRDD = inputRDD.mapValues(lambda x: x+1)
# {(1, [1]), (3,[4,6])}
for (i,j) in outputRDD.collect():
  for item in j:
    print item
# output item
<pyspark.resultiterable.ResultIterable object at 0x110a7ec90>
2
<pyspark.resultiterable.ResultIterable object at 0x110a7ed50>
4
6

What this API does is to group values ​​with the same key.


flatMapValues()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
outputRDD = inputRDD.flatMapValues(lambda x: range(x, 6))
# output
[(1, 2), (1, 3), (1, 4), (1, 5), (3, 4), (3, 5)]

What this API does is apply a function that returns an iterator to each value in the pair RDD, and then generate a key-value pair record corresponding to the original key for each element returned. Usually used for symbolization.


keys()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
outputRDD = inputRDD.keys()
# output
[1,3,3]

What this API does is return an RDD containing only the keys.


values()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
outputRDD = inputRDD.values()
# output
[2,4,6]

What this API does is return an RDD containing only values.


sortByKey()函数

Function example:

inputRDD = sc.parallelize([(11,2),(13,4),(3,6)])
outputRDD = inputRDD.sortByKey()
# output
[(3,6),(11,2),(13,4)]

What this API does is return an RDD sorted by key.


combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
outputRDD = inputRDD.combineByKey(
  (lambda x: (x, 1)),
  (lambda x, y: (x[0] + y, x[1] + 1)),
  (lambda x, y: (x[0] + y[0], x[1] + y[1]))
)
# output
{[(1, (2, 1)), (3, (10, 2))]}

To understand combineByKey(), you must first understand how it handles each element when processing data. Since combineByKey() iterates over all the elements in the partition, each element's key has either not been encountered or the same as the key of a previous element.
If this is a new element, combineByKey() will use a function called createCombiner() to create the initial value of the accumulator for that key. Note that this happens the first time each key appears in each partition, not the first time a key appears in the entire RDD.
If this is a value that has been encountered before processing the current partition, it uses the mergeValue() method to merge the current value of the key's accumulator with this new value.
Since each partition is processed independently, there can be multiple accumulators for the same key. If two or more partitions have accumulators corresponding to the same key, you need to use the user-provided mergeCombiners() method to combine the results of each partition.


API for action operations


top()函数

Function example:

outputdata = inputRDD.top(10)
for line in outputdata:
  print line

The function of this API is to return the first K elements of inputRDD, the returned data type is a list, and the length is K.


take()函数

Function example:

outputdata = inputRDD.take(10)
for line in outputdata:
  print line

The function of this API is to return K elements in inputRDD, the returned data type is a list, and the length is K.


first()函数

Function example:

inputRDD.first()

The function of this API is to return the first element of inputRDD, the returned data type is a string, and the encoding is Unicode.


collect()函数

Function example:

outputdata = inputRDD.collect()
for line in outputdata:
  print line

The function of this API is to return all elements in the inputRDD, and the returned data type is a list. Note that this API can only be used on small data. If the amount of data is too large, it will consume a lot of time and memory.


count()函数

Function example:

len = inputRDD.count()
print len

The purpose of this API is to return the number of elements in the inputRDD.


reduce()函数

Function example:

inputRDD = sc.parallelize([1,2,3,4,5,6,7,8,9])
output = inputRDD.reduce(lambda x,y : x+y)
# output
45

The role of this API is to receive a function as a parameter, this function to operate on the data of the element type of the two RDDs and return a new element of the same type.


takeSample(withReplacement, num, seed = None)函数

Function example:

inputRDD = sc.parallelize(range(10))
output = inputRDD.takeSample(True, 20)
# output
[8, 5, 5, 7, 7, 6, 3, 1, 0, 7, 5, 5, 4, 3, 3, 4, 8, 2, 7, 4]
output = inputRDD.takeSample(False, 5)
# output
[2, 9, 7, 8, 0]

What this API does is return a subset of the specified length. withReplacementIf True, then the returned elements can be recollected.


countByValue()函数

Function example:

inputRDD = sc.parallelize(['a','b','c','d'])
output = inputRDD.countByValue()
# output
{'a': 1, 'c': 1, 'b': 1, 'd': 1}

The role of this API is to count the number of times each element appears in the RDD.


substractByKey()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
otherRDD = sc.parallelize([(3,9)])
output = inputRDD.subtractByKey(otherRDD)
# output
[(1,2)]

What this API does is remove elements in inputRDD with the same key as otherRDD.


join()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
otherRDD = sc.parallelize([(3,9)])
output = inputRDD.join(otherRDD)
# output
[(3, (4, 9)), (3, (6, 9))]

The purpose of this API is to perform an inner join on two RDDs.


cogroup()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
otherRDD = sc.parallelize([(3,9)])
output = inputRDD.cogroup(otherRDD)
# output
{(1, ([2], [])), (3, ([4, 6], [9]))}

The purpose of this API is to group together data in two RDDs with the same key.


rightOuterJoin()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
otherRDD = sc.parallelize([(3,9)])
output = inputRDD.rightOuterJoin(otherRDD)
# output
[(3, (4, 9)), (3, (6, 9))]

The role of this API is to perform a join operation on two RDDs, ensuring that the keys of the first RDD must exist (right outer join).


leftOuterJoin()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
otherRDD = sc.parallelize([(3,9)])
output = inputRDD.leftOuterJoin(otherRDD)
# output
[(1, (2, None)), (3, (4, 9)), (3, (6, 9))]

The role of this API is to perform a join operation on two RDDs, ensuring that the keys of the second RDD must exist (left outer join).


countByKey()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
output = inputRDD.countByKey()
# output
{1: 1, 3: 2}

The role of this API is to count the elements corresponding to each key separately.


collectAsMap()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
output = inputRDD.collectAsMap()
# output
{1: 2, 3: 6}

The role of this API is to return the results in the form of a mapping table for querying. However, if there are multiple values ​​for the same key in the RDD, the latter value will overwrite the former value, and the resulting key is unique and corresponds to a value.


lookup()函数

Function example:

inputRDD = sc.parallelize([(1,2),(3,4),(3,6)])
output = inputRDD.lookup(3)
# output
[4, 6]

What this API does is return all values ​​corresponding to a given key.


Read data from Hive


from pyspark.sql import HiveContext

hiveCtx = HiveContext(sc)
rows = hiveCtx.sql("select name, age from users")
firstRow = rows.first()
print firstRow.name

There are three ways to pass functions to spark in Python


# 方法一
word = rdd.filter(lambda s: "error" in s)

# 方法二
def containsError(s):
  return "error" in s
word = rdd.filter(containsError)

# 方法三
class WordFunctions(object):
  ...
  def getMatchesNoReference(self, rdd):
    # 安全方式:只把需要的字段提取到局部变量中
    query = self.query
    return rdd.filter(lambda x: query in x)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325524761&siteId=291194637