spark-Detailed explanation of common conversion and action operations of rdd

This article originated from a personal public account : TechFlow , original is not easy, seek attention

Today is the third article of spark , we continue to look at some operations of RDD.

We said earlier that the operation of RDD in spark can be divided into two types, one is transformation and the other is action . In the conversion operation, spark will not calculate the results for us, but will generate a new RDD node and record this operation. Only when the action operation is executed, spark will calculate the entire calculation from the beginning.

The conversion operations can be further divided into conversion operations for elements and conversion operations for collections.

Element-specific conversion operations

The conversion operations for elements are very common, and the most commonly used are map and flatmap. From the name point of view, both of them are map operations. As we all know about map operations, they are mentioned in previous MapReduce articles and articles on Python map and reduce usage. In short, an operation can be mapped to each element.

For example, suppose we have a sequence [1, 3, 4, 7], and we want to square each element. Of course we can use for loop execution, but the better way in spark is to use map.

nums = sc.parallelize([1, 3, 4, 7])
spuare = nums.map(lambda x: x * x)

We know that map is a conversion operation, so square is still an RDD , and we directly output it without getting results, only RDD related information:

The conversion diagram of the internal RDD looks like this:

If we want to see the results, we must perform action operations, such as take , we take a look at the results:

Consistent with our expectations, the map operation should already be familiar to students who have been paying attention before, so what is this flatmap?

The difference is in this flat, we all know that flat means flat, so flatmap means that the result of map execution is flat . To put it bluntly, that is to say, if the result after the execution of the map is an array, then the array will be disassembled and the contents inside will be taken out and combined.

Let's take a look at an example:

texts = sc.parallelize(['now test', 'spark rdd'])
split = texts.map(lambda x: x.split(' '))

Since the object we execute map is a string, a string array will be obtained after the split operation of a string . If we execute map, the result will be:

What if we execute flatmap? We can also try:

In comparison, have you noticed the difference?

Yes, the result of map execution is an array of arrays, because after each string split is an array, we splicing the array together is naturally an array of arrays. FlatMap will flatten these arrays and put them together, which is the biggest difference between the two.

Conversion actions for collections

The conversion operations for elements are described above. Let ’s take a look at the conversion operations for collections.

The operations for collections are probably union, distinct, intersection and subtract. We can first look at the following picture to have an intuitive feeling, and then we will analyze them one by one:

First look at distinct, which, as the name suggests, is to remove duplicates. It is the same as distinct in SQL. The input of this operation is two sets of RDDs. After execution, a new RDD is generated. All elements in this RDD are unique . One thing to note is that the cost of performing distinct is very large, because it will perform a shuffle operation to disorder all the data to ensure that there is only one copy of each element. If you don't understand what the shuffle operation means, it doesn't matter, we will focus on explaining it in the subsequent articles. Just remember that it's expensive.

The second operation is union, which is also easy to understand. It is to merge all the elements in the two RDDs . You can think of it as an extend operation in a Python list. It is also the same as extend. It does not detect duplicate elements, so if the same elements in two merged sets are not filtered, they will be filtered. Reserved.

The third operation is intersection, which means intersection , that is, the part where the two sets overlap. This should be quite understandable, let's look at the following picture:

The blue part in the figure below, which is the intersection of the two sets A and B, is the result of A.intersection (B), which is the common element in the two sets. Similarly, this operation will also perform shuffle , so the overhead is equally large, and this operation will remove duplicate elements.

The last one is the subtract, which is the difference set , which is the element that belongs to A but not to B. Similarly, we can use a graph to represent:

The gray shaded part in the above figure is the difference between the two sets A and B. Similarly, this operation will also perform shuffle, which is very time-consuming.

In addition to the above, there are cartesian, namely Cartesian product , sample sampling and other set operations, but relatively little is used, here is not introduced too much, interested students can learn about it, it is not complex.

Action operation

The most commonly used action operation in RDD should be the operation of obtaining the result. After all, we have calculated for half a day to get the result. Obtaining RDD is obviously not our purpose. The RDDs for obtaining results are mainly take, top, and collect . These three have no special usage. Let me briefly introduce them.

Where collect is to get all the results, it will return all the elements. Both take and top need to pass in a parameter to specify the number of items. Take is to return the specified number of results from the RDD. Top is to return the first few results from the RDD. The usage of top and take is exactly the same, the only difference is Whether the result obtained is the first.

In addition to these, there is also a very common action is count, this should not need to be said, the operation of calculating the number of data, the count can know how many pieces of data.

reduce

In addition to these relatively simple ones, it is interesting to introduce two other ones. First, let's introduce reduce. Reduce, as the name suggests, is reduce in MapReduce. Its usage is almost the same as reduce in Python. It accepts a function to perform the merge operation. Let's look at an example:

In this example, our reduce function is to add two ints, and the reduce mechanism will repeat this operation to merge all the data, so the final result is 1 + 3 + 4 + 7 = 15.

fold

In addition to reduce, there is an action called fold. It is exactly the same as reduce. The only difference is that it can customize an initial value and is for partitioning. We also take the above example as an example:

Looking at this example directly may be a bit daunting. A simple explanation will make you understand, but it is not complicated. We noticed that when we used parallelize to create data, we added an additional parameter 2, which represents the number of partitions . Simple can be understood as the array [1, 3, 4, 7] will be divided into two parts, but if we directly collect it is still the original value.

Now we use fold, two parameters are passed in, and an initial value of 2 is passed in addition to a function. So the whole calculation process is like this:

The answer for the first partition is 1 + 3 + 2 = 6, the answer for the second partition is 4 + 7 + 2 = 13, and finally the two partitions are merged: 6 + 13 + 2 = 21.

That is to say, we assign a starting value to the result of each partition , and assign a starting value to the result after the partition merge.

aggregate

To be honest, this action is the most difficult to understand because it is abnormal. First, both reduce and fold have a requirement that the type of the return value must be the same as the data type of rdd. For example, if the data type is int, then the returned result should also be int.

But for some scenarios, this is not applicable. For example, if we want to average, we need to know the sum of the term and the number of times the term appears, so we need to return two values. At this time, the value we initialized should be 0, 0, that is, both for the sum and the count start from 0, then we need to pass in two functions, such as writing this:

nums.aggregate((0, 0), lambda x, y: (x[0] + y, x[1] + 1), lambda x, y: (x[0] + y[0], x[1] + y[1]))

It's inevitable to see this line of code being dumb, don't worry, we explain it bit by bit.

The first is the first lambda function, where x is not a value but two values , or a two-tuple, which is the result we finally returned. In our return expectation, the first number returned Sum of nums, the second returned number is the number of numbers in nums. The y here is the result of nums input. Obviously, the result of nums input is only one int, so y here is one-dimensional. Then we require and of course use x [0] + y, which means that the value of y is added to the first dimension, and the second dimension is naturally increased by one, because we should add one for each number read.

This point is relatively easy to understand. The second function may be a bit laborious. The second function is different from the first. It is not used to process nums data, but is used to process partitions . When we execute aggregate, spark is not a single-threaded execution, it will split the data in nums into many partitions, each partition needs to be merged after getting the result, and this function will be called when merging.

Similar to the first function, the first x is the final result, and y is the value that needs to be merged in at the end of other partition operations. So y here is two-dimensional, the first dimension is the sum of a certain partition, the second dimension is the number of elements in a certain partition, then of course we have to add it to x.

The above figure shows the calculation process when two partitions, where lambda1 is the first anonymous function we pass in. Similarly, lambda2 is the second anonymous function we pass in. I think it should be easy to understand the combination chart.

In addition to these, there are some action operations. We will not repeat them for the sake of space. If there are any of them in the subsequent articles, we will explain them in detail. One of the main reasons why beginners are more resistant to learning spark is because it feels too complicated, and even the operation still distinguishes between conversion operations and action operations. In fact, all this is for lazy evaluation to optimize performance . In this way, we can combine several operations to execute together, thereby reducing the consumption of computing resources. For distributed computing frameworks, performance is a very important indicator. Understanding this, why did spark make such a design? It's easy to understand.

This is true not only for spark, but also for deep learning frameworks such as TensorFlow. In essence, many seemingly counter-intuitive designs have deeper reasons. After understanding, it is actually easy to guess. Action operations, if only some calculations, then eight out of ten are conversion operations.

Endurance operation

RDDs in Spark are evaluated lazily, and sometimes we will want to use the same RDD multiple times . If we simply call the action operation, then spark will repeatedly calculate the RDD and all its corresponding data and other dependencies, which obviously brings a lot of overhead. We naturally hope that the RDDs that we often use can be cached and used at any time when we need them, rather than having to run again every time we use them.

In order to solve this problem, spark provides persistent operations. The so-called persistence can be simply understood as caching. The usage is also very simple, we only need to persist the RDD:

texts = sc.parallelize(['now test', 'hello world'])
split = texts.split(lambda x: x.split(' '))
split.persist()

After calling persistence, the RDD will be cached in memory or disk , and we can call it out at any time when we need it, without having to run the entire process before. And spark supports multiple levels of persistence operations, we can control through StorageLevel variables. Let's take a look at the value of this StorageLevel:

We can select the corresponding cache level as needed. Of course, since there is persistence, there is naturally anti-persistence . For some RDDs that no longer need to be cached, we can call unpersist to remove them from the cache.

Although today's content seems to be a variety of operations, some of them are not often used. We only need to have an impression. The details of the specific operations can be studied carefully when they are used. I hope everyone can ignore these unimportant details and grasp the essence of the core.

Today's article is just that. If you feel something rewarded, please follow or repost it. Your effort is very important to me.

Insert picture description here

TechFlow

Published 117 original articles · Like 61 · Visits 10,000+

Private letter concerns