[Python] PySpark data calculation ① ( RDD#map method | RDD#map syntax | pass in ordinary function | pass in lambda anonymous function | chain call )





1. RDD#map method




1. Introduction of RDD#map method


In PySpark, the RDD object provides a data calculation method RDD#map method;

The RDD#map function can apply a function to each element in the RDD data, the function to be applied,

  • Each element can be converted to another type,
  • You can also perform specified operations on the original elements of RDD data;

After the calculation is completed, a new RDD object will be returned;


2. RDD#map syntax


The map method, also known as the map operator, can process the data elements in the RDD one by one, and the processing logic needs to be passed to the map function through external parameters;


RDD#map syntax:

rdd.map(fun)

The fun passed in is a function whose function type is:

(T) -> U

The parentheses and the content in front of the above function type indicate the parameter type of the function,

  • () means no parameters are passed in;
  • (T) means passing in 1 parameter;

At the same time, the T type is a generic type, which means any type, that is to say, the parameters of this function can be of any type;


The U behind the right arrow of the above function type -> Uindicates the return value type of the function.

  • (T) -> UIndicates that the parameter type is T, the return value type is U, and T and U types are any type, which can be one type or different types;
  • (T) -> TIn the function type, T can be any type, but if the parameters are determined, the return value must also be of the same type;

The U type is also generic, which means any type, that is to say, the parameters of this function can be of any type;


3. RDD#map usage


The RDD#map method accepts a function as a parameter, and the function parameter will be applied to each element in the RDD data during calculation;

The following code, passing in a lambda anonymous function, multiplies the elements in the RDD object by 10;

# 将 RDD 对象中的元素都乘以 10
rdd.map(lambda x: x * 10)  

4. Code example - RDD#map numerical calculation (pass in ordinary functions)


In the code below,

First, create an RDD containing integers,

# 创建一个包含整数的 RDD
rdd = sparkContext.parallelize([1, 2, 3, 4, 5])

Then, use the map() method to multiply each element by 10;

# 为每个元素执行的函数
def func(element):
    return element * 10


# 应用 map 操作,将每个元素乘以 10
rdd2 = rdd.map(func)

Finally, print the contents of the new RDD;

# 打印新的 RDD 中的内容
print(rdd2.collect())

Code example:

"""
PySpark 数据处理
"""

# 导入 PySpark 相关包
from pyspark import SparkConf, SparkContext
# 为 PySpark 配置 Python 解释器
import os
os.environ['PYSPARK_PYTHON'] = "Y:/002_WorkSpace/PycharmProjects/pythonProject/venv/Scripts/python.exe"

# 创建 SparkConf 实例对象 , 该对象用于配置 Spark 任务
# setMaster("local[*]") 表示在单机模式下 本机运行
# setAppName("hello_spark") 是给 Spark 程序起一个名字
sparkConf = SparkConf() \
    .setMaster("local[*]") \
    .setAppName("hello_spark")

# 创建 PySpark 执行环境 入口对象
sparkContext = SparkContext(conf=sparkConf)

# 打印 PySpark 版本号
print("PySpark 版本号 : ", sparkContext.version)

# 创建一个包含整数的 RDD
rdd = sparkContext.parallelize([1, 2, 3, 4, 5])


# 为每个元素执行的函数
def func(element):
    return element * 10


# 应用 map 操作,将每个元素乘以 10
rdd2 = rdd.map(func)

# 打印新的 RDD 中的内容
print(rdd2.collect())

# 停止 PySpark 程序
sparkContext.stop()

Results of the :

Y:\002_WorkSpace\PycharmProjects\pythonProject\venv\Scripts\python.exe Y:/002_WorkSpace/PycharmProjects/HelloPython/hello.py
23/07/30 21:39:59 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/30 21:39:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
PySpark 版本号 :  3.4.1
[10, 20, 30, 40, 50]

Process finished with exit code 0

insert image description here


5. Code example - RDD#map numerical calculation (input lambda anonymous function)


In the code below,

First, create an RDD containing integers,

# 创建一个包含整数的 RDD
rdd = sparkContext.parallelize([1, 2, 3, 4, 5])

Then, use the map() method to multiply each element by 10, where the lambda function is passed in as a parameter, which accepts an integer parameter element and returns element * 10;

# 应用 map 操作,将每个元素乘以 10
rdd2 = rdd.map(lambda element: element * 10)

Finally, print the contents of the new RDD;

# 打印新的 RDD 中的内容
print(rdd2.collect())

Code example:

"""
PySpark 数据处理
"""

# 导入 PySpark 相关包
from pyspark import SparkConf, SparkContext
# 为 PySpark 配置 Python 解释器
import os
os.environ['PYSPARK_PYTHON'] = "Y:/002_WorkSpace/PycharmProjects/pythonProject/venv/Scripts/python.exe"

# 创建 SparkConf 实例对象 , 该对象用于配置 Spark 任务
# setMaster("local[*]") 表示在单机模式下 本机运行
# setAppName("hello_spark") 是给 Spark 程序起一个名字
sparkConf = SparkConf() \
    .setMaster("local[*]") \
    .setAppName("hello_spark")

# 创建 PySpark 执行环境 入口对象
sparkContext = SparkContext(conf=sparkConf)

# 打印 PySpark 版本号
print("PySpark 版本号 : ", sparkContext.version)

# 创建一个包含整数的 RDD
rdd = sparkContext.parallelize([1, 2, 3, 4, 5])

# 应用 map 操作,将每个元素乘以 10
rdd2 = rdd.map(lambda element: element * 10)

# 打印新的 RDD 中的内容
print(rdd2.collect())

# 停止 PySpark 程序
sparkContext.stop()

Results of the :

Y:\002_WorkSpace\PycharmProjects\pythonProject\venv\Scripts\python.exe Y:/002_WorkSpace/PycharmProjects/HelloPython/hello.py
23/07/30 21:46:53 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/30 21:46:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
PySpark 版本号 :  3.4.1
[10, 20, 30, 40, 50]

Process finished with exit code 0

insert image description here


6. Code example - RDD#map numerical calculation (chain call)


In the following code, first multiply each element data in the RDD object by 10, then add 5 to each element of the calculated data, and finally divide each element of the latest calculation data by 2, the whole process Through functional programming, chain calls are completed;

The core code is as follows:

# 创建一个包含整数的 RDD
rdd = sparkContext.parallelize([1, 2, 3, 4, 5])

# 应用 map 操作,将每个元素乘以 10
rdd2 = rdd.map(lambda element: element * 10)\
    .map(lambda element: element + 5)\
    .map(lambda element: element / 2)

# 打印新的 RDD 中的内容
print(rdd2.collect())

Code example:

"""
PySpark 数据处理
"""

# 导入 PySpark 相关包
from pyspark import SparkConf, SparkContext
# 为 PySpark 配置 Python 解释器
import os
os.environ['PYSPARK_PYTHON'] = "Y:/002_WorkSpace/PycharmProjects/pythonProject/venv/Scripts/python.exe"

# 创建 SparkConf 实例对象 , 该对象用于配置 Spark 任务
# setMaster("local[*]") 表示在单机模式下 本机运行
# setAppName("hello_spark") 是给 Spark 程序起一个名字
sparkConf = SparkConf() \
    .setMaster("local[*]") \
    .setAppName("hello_spark")

# 创建 PySpark 执行环境 入口对象
sparkContext = SparkContext(conf=sparkConf)

# 打印 PySpark 版本号
print("PySpark 版本号 : ", sparkContext.version)

# 创建一个包含整数的 RDD
rdd = sparkContext.parallelize([1, 2, 3, 4, 5])

# 应用 map 操作,将每个元素乘以 10
rdd2 = rdd.map(lambda element: element * 10)\
    .map(lambda element: element + 5)\
    .map(lambda element: element / 2)

# 打印新的 RDD 中的内容
print(rdd2.collect())

# 停止 PySpark 程序
sparkContext.stop()

Results of the :

Y:\002_WorkSpace\PycharmProjects\pythonProject\venv\Scripts\python.exe Y:/002_WorkSpace/PycharmProjects/HelloPython/hello.py
23/07/30 21:50:29 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/30 21:50:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
PySpark 版本号 :  3.4.1
[7.5, 12.5, 17.5, 22.5, 27.5]

Process finished with exit code 0

insert image description here

Guess you like

Origin blog.csdn.net/han1202012/article/details/132011469