Pyspark basic introduction 4_RDD conversion operator

Pyspark

Note: If you think the blog is good, don’t forget to like and collect it. I will update the content related to artificial intelligence and big data every week. Most of the content is original, Python Java Scala SQL code, CV NLP recommendation system, etc., Spark Flink Kafka, Hbase, Hive, Flume, etc. are all pure dry goods, and the interpretation of various top conference papers makes progress together.
Continue to share with you today Pyspark basic introduction 3
#博学谷IT学报技术支持



foreword

Today I will share with you the operations related to Spark RDD operators.
RDD operator: refers to the RDD object that provides a lot of functions with special functions, we generally call such functions operators (in vernacular: refers to the API of RDD)


1. Classification of RDD operators

The entire RDD operator is divided into two categories: Transformation (transformation operator) and Action (action operator)
transformation operator:
1- All transformation operators will return a new RDD after execution is completed
2- All The conversion operators are all LAZY (lazy), and will not be executed immediately. At this time, it can be considered that the calculation rules of RDD are defined through conversion operators.
3- The conversion operator must meet the Action operator to trigger execution

Action operator:
1- After the action operator is executed, it will not return an RDD, or there will be no return value, or it will return other
2- Action operators are executed immediately, and an action operator will generate a Job Execute the task and run all the RDDs that this action operator depends on

2. Conversion operator

insert image description here

1. map operator

  • Format: rdd.map(fn)
  • Description: According to the function passed in, perform one-to-one conversion operations on the data, pass in a row, and return a row
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo1").setMaster("local[*]")
    sc = SparkContext(conf=conf)
    rdd_init = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    rdd_map = rdd_init.map(lambda num: num + 1)
    rdd_res = rdd_map.collect()
    print(rdd_res)
    sc.stop()

2.groupBy operator

  • Format: groupBy(fn)
  • Description: Group the data according to the function passed in
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo2").setMaster("local[*]")
    sc = SparkContext(conf=conf)
    rdd_init = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

    # def jo(num):
    #     if num % 2 == 0:
    #         return 'o'
    #     else:
    #         return 'j'

    rdd_group_by = rdd_init.groupBy(lambda num: 'o' if num % 2 == 0 else 'j')
    rdd_res = rdd_group_by.mapValues(list).collect()
    print(rdd_res)
    sc.stop()

3. filter operator

  • Format: filter(fn)
  • Description: The filter operator can filter the data according to the filter conditions specified in the function. If the condition returns True, it means to keep it, and if it returns False, it means to filter it out.
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo3").setMaster("local[*]")
    sc = SparkContext(conf=conf)
    rdd_init = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    rdd_res = rdd_init.filter(lambda num: num > 3).collect()
    print(rdd_res)
    sc.stop()

4. flatMap operator

  • Format: flatMap(fn)
  • Explanation: On the basis of the map operator, a flattening operation is added, which is mainly suitable for operations that contain multiple contents in one line, and realizes one-to-multiple operations
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo4").setMaster("local[*]")
    sc = SparkContext(conf=conf)
    rdd_init = sc.parallelize(['张三 李四 王五 赵六', '田七 周八 李九'])
    rdd_res = rdd_init.flatMap(lambda line: line.split(' ')).collect()
    print(rdd_res)
    sc.stop()

5.union (union) and intersection (intersection) operators

Format: rdd1.union|intersection(rdd2)

from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo5").setMaster("local[*]")
    sc = SparkContext(conf=conf)
    rdd1 = sc.parallelize([3, 1, 5, 7, 9])
    rdd2 = sc.parallelize([5, 8, 2, 4, 0])

    # rdd_res = rdd1.union(rdd2).collect()
    # rdd_res = rdd1.union(rdd2).distinct().collect()
    rdd_res = rdd1.intersection(rdd2).collect()
    print(rdd_res)
    sc.stop()

6. groupByKey operator

  • Format: groupByKey()
  • Description: Group operation according to key
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo6").setMaster("local[*]")
    sc = SparkContext(conf=conf)

    rdd_init = sc.parallelize([('c01', '张三'), ('c02', '李四'), ('c02', '王五'),
                               ('c02', '赵六'), ('c03', '田七'), ('c03', '周八')])

    rdd_res = rdd_init.groupByKey().mapValues(list).collect()
    print(rdd_res)
    sc.stop()

7. reduceByKey operator

  • Format: reduceByKey(fn)
  • Description: Group according to the key, put the value data in a group into a list, and perform aggregation calculation operations on this list based on the incoming function
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo7").setMaster("local[*]")
    sc = SparkContext(conf=conf)

    rdd_init = sc.parallelize([('c01', '张三'), ('c02', '李四'), ('c02', '王五'),
                               ('c02', '赵六'), ('c03', '田七'), ('c03', '周八')])

    rdd_res = rdd_init.map(lambda kv: (kv[0], 1)).reduceByKey(lambda arr, curr: arr + curr).collect()
    print(rdd_res)
    sc.stop()

8. sortByKey operator

  • Format: sortByKey(ascending = True|False)
  • Description: Sorting operation is performed according to the key. By default, it is sorted in ascending order according to the key. If reverse order is required, set ascending to False
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo8").setMaster("local[*]")
    sc = SparkContext(conf=conf)

    rdd_init = sc.parallelize([('c03', '张三'), ('c04', '李四'), ('c05', '王五'),
                               ('c01', '赵六'), ('c07', '田七'), ('c08', '周八')])

    rdd_res = rdd_init.sortByKey(ascending=False).collect()
    print(rdd_res)
    sc.stop()

9. countByKey and countByValue operators

  • countByKey() Group by key to count how many elements are under each group
  • countByValue() Group by value, and count how many of the same value
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo9").setMaster("local[*]")
    sc = SparkContext(conf=conf)

    rdd_init = sc.parallelize([('c01', '张三'), ('c02', '李四'), ('c02', '王五'),
                               ('c02', '赵六'), ('c03', '田七'), ('c03', '周八'), ('c01', '张三')
                               ])

    rdd_res0 = rdd_init.countByKey()
    rdd_res1 = rdd_init.countByValue()

    print(rdd_res0)
    print(rdd_res1)
    sc.stop()

3. Action operator

insert image description here

1. reduce operator

  • Format: reduce(fn)
  • Function: Aggregate the data according to the function passed in
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo1").setMaster("local[*]")
    sc = SparkContext(conf=conf)
    rdd_init = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    rdd_res = rdd_init.reduce(lambda agg, curr: agg + curr)
    print(rdd_res)
    sc.stop()

2. first operator

  • Format: first()
  • Description: Get the first element
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo2").setMaster("local[*]")
    sc = SparkContext(conf=conf)
    rdd_init = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    rdd_res = rdd_init.first()
    print(rdd_res)
    sc.stop()

3.take operator

  • Format: take(N)
  • Description: Get the first N elements, similar to the limit operation
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo3").setMaster("local[*]")
    sc = SparkContext(conf=conf)
    rdd_init = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    rdd_res = rdd_init.take(3)
    print(rdd_res)
    sc.stop()

4. top operator

  • Format: top(N, [fn])
  • Description: Perform a reverse sort operation on the data set. If it is a kv type, the default is to sort the key and get the first N elements
  • fn: You can customize the sorting, according to who to sort
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo4").setMaster("local[*]")
    sc = SparkContext(conf=conf)

    rdd_init = sc.parallelize([('c03', 10), ('c04', 30), ('c05', 20),
                               ('c01', 20), ('c07', 80), ('c08', 5)])

    rdd_res = rdd_init.top(3, lambda kv: kv[1])
    print(rdd_res)
    sc.stop()

5.count operator

  • Format: count()
  • Description: Count how many
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo5").setMaster("local[*]")
    sc = SparkContext(conf=conf)

    rdd_init = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    rdd_res = rdd_init.count()
    print(rdd_res)
    sc.stop()

6. foreach operator

  • Format: foreach(fn)
  • Description: Traversing the data set, what to do after traversing depends on the function passed in
from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo6").setMaster("local[*]")
    sc = SparkContext(conf=conf)

    rdd_init = sc.parallelize([('c03', 10), ('c04', 30), ('c05', 20),
                               ('c01', 20), ('c07', 80), ('c08', 5)])

    rdd_res = rdd_init.foreach(lambda kv : print(kv))
    print(rdd_res)
    sc.stop()

7. takeSample operator

  • Format: takeSample(True|False, N,seed(seed value))

    • Parameter 1: Whether to allow repeated sampling
    • Parameter 2: How many samples, if repeated sampling is allowed, the number of samples is not limited, otherwise it is at most equal to its own number
    • Parameter 3: Set the seed value, the value can be written casually, once it is written hard, it means that the content of each sampling is also fixed (optional) If there is no special need, generally do not set
  • Role: data sampling

from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
    conf = SparkConf().setAppName("demo6").setMaster("local[*]")
    sc = SparkContext(conf=conf)

    rdd_init = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    rdd_res = rdd_init.takeSample(withReplacement=True, num=5, seed=1)
    print(rdd_res)
    sc.stop()

Summarize

Today I mainly share the conversion operator and action operator of RDD, and I will continue to share some other important operators of RDD next time.

Guess you like

Origin blog.csdn.net/weixin_53280379/article/details/129131110