Key-value pair operations

1 Pair RDD

RDD of key-value pairsis a common data type required by many operations in Spark. Spark provides some proprietary operations for RDDs containing key-value pairs, and these RDDs are called pair RDDs. For example, pair RDD provides the reduceByKey() method, which can reduce the data corresponding to each key separately.
There are many ways to create a pair RDD in spark. For example, many data formats that store key-value pairs will directly return a pair RDD composed of its key-value pair data when read. In addition, when you need to convert an ordinary RDD into a pair RDD, you can call the map() function to achieve it, and the passed function needs to return a key-value pair.

2 Create Pair RDDs

In python, in order to make the data after extracting the key usable in the function, it is necessary to return an RDD consisting of two tuples. The following is to create a pair RDD with the first word as the key

pairs = lines.map(lambda x: (x.split()[0], x))

3 Conversion operation of Pair RDD

Some transformation operations on pair RDD are summarized as follows:
insert image description here
Transformation operations on two pair RDDs (rdd={(1,2), (3,4), (3, 6)} other={(3,9) })
insert image description here
insert image description here
Pair RDD is still RDD (the elements are tuples in Python), so it also supports the functions supported by RDD. For example, we can filter a pair RDD for rows longer than 20 characters:

result = pairs.filter(lambda keyValue: len(keyValue[1] <20)

3.1 Aggregation operation

When a dataset is organized in key-value pairs, it is common to aggregate elements with the same key for some statistics. The pair RDD has a corresponding conversion operation for the key. Spark has a similar set of operations that can combine values ​​with the same key,These operations return RDDs, so they are transformation operations rather than action operations. reduceByKey() performs parallel reduction operations for each key in the dataset, and each reduction operation combines values ​​with the same key. The following is the use of reduceByKey() and mapValues() in Python to calculate the average value corresponding to each key:

rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))

The corresponding result is shown in the figure below:
insert image description here
combineByKey() is the most commonly used function for aggregation based on keys. Most key-based aggregation functions are implemented using it, and combineByKye() allows users to return a return value that is different from the type of the input data. To understand combineByKey(), it is necessary to understand how it handles each element when processing data. The flow chart of data processing is as follows:

insert image description here
Parallelism Tuning: Each RDD has a fixed number of partitions, and the number of partitions determines the degree of parallelism when performing operations on the RDD. Spark provides the repartition() function, which shuffles the data across the network and creates a new set of partitions. But repartitioning data is a relatively expensive operation. In python, we can check the number of RDD partitions through rdd.getNumPartitions.

3.2 Data grouping

groupByKey() uses the keys in the RDD to group the data if the data has been keyed as expected. For an RDD consisting of keys of type K and values ​​of type V, the resulting RDD will be of type [K, Iterable[V]]. And groupBy() can be used on unpaired data, and can also be grouped according to the condition that the keys are the same.

3.3 Connection

There are various connection methods:Right outer join ( rightOuterJoin(other) ), left outer join ( leftOuterJoin(other) ), cross join and inner join (join), the ordinary join operator represents an inner connection. In python, if a value does not exist, it is represented by None.

3.4 Data Sorting

If the keys have a defined order, the key-value pair RDD can be sorted. After the data is sorted, subsequent operations such as collect() or save() on the data will result in ordered data. we can usesortByKey() function for sorting, the following is to convert the integer to a string, and then use the string comparison function to sort the RDD:

rdd.sortByKey(ascending=True, numPartitions=None, keyfunc=lambda x: str(x))

4 Operations of Pari RDD

Like transformation operations, all traditional operations supported by the underlying RDD are also available on pair RDDs. Pair RDD provides some additional actions that allow us to take full advantage of the key-value pair characteristics of the data. As follows:
insert image description here

Guess you like

Origin blog.csdn.net/BGoodHabit/article/details/121327791