[Python] PySpark data calculation ④ ( RDD#filter method - filter elements in RDD | RDD#distinct method - deduplicate elements in RDD )





1. RDD#filter method




1. Introduction to RDD#filter method


The RDD#filter method can filter the elements in the RDD object according to the specified conditions and return a new RDD object;

The RDD#filter method will not modify the original RDD data;

Instructions :

new_rdd = old_rdd.filter(func)

In the above code,

  • old_rdd is the original RDD object,
  • Call the filter method, the incoming func parameter is a function or lambda anonymous function, used to define the filter conditions,
    • If the func function returns True, the element is retained;
    • If the func function returns False, the element is deleted;
  • new_rdd is the filtered RDD object;

2. RDD#filter function syntax


RDD#filter method syntax:

rdd.filter(func)

The above method accepts a function as a parameter, which defines the conditions to be filtered; elements that meet the conditions are retained, and elements that do not meet the conditions are deleted;

The following introduces the type requirements of the func function type parameters in the filter function;


func function type description:

(T) -> bool

Pass in the func function parameter in the filter method, whose function type accepts an element of any type as a parameter, and returns a boolean value, which is used to indicate whether the element should be kept in the new RDD;

  • return True keep element;
  • return False delete element;


3. Code example - RDD#filter method example


The core code in the code below is:

# 创建一个包含整数的 RDD
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9])

# 使用 filter 方法过滤出偶数, 删除奇数
even_numbers = rdd.filter(lambda x: x % 2 == 0)

# 输出过滤后的结果
print(even_numbers.collect())

In the above code, the original code is an integer between 1 and 9;

pass in lambda anonymous function, lambda x: x % 2 == 0, pass in number,

  • Return True if it is even, keep the element;
  • Return False if odd, delete element;

Code example:

"""
PySpark 数据处理
"""

# 导入 PySpark 相关包
from pyspark import SparkConf, SparkContext
# 为 PySpark 配置 Python 解释器
import os
os.environ['PYSPARK_PYTHON'] = "Y:/002_WorkSpace/PycharmProjects/pythonProject/venv/Scripts/python.exe"

# 创建 SparkConf 实例对象 , 该对象用于配置 Spark 任务
# setMaster("local[*]") 表示在单机模式下 本机运行
# setAppName("hello_spark") 是给 Spark 程序起一个名字
sparkConf = SparkConf() \
    .setMaster("local[*]") \
    .setAppName("hello_spark")

# 创建 PySpark 执行环境 入口对象
sc = SparkContext(conf=sparkConf)

# 打印 PySpark 版本号
print("PySpark 版本号 : ", sc.version)

# 创建一个包含整数的 RDD
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9])

# 使用 filter 方法过滤出偶数, 删除奇数
even_numbers = rdd.filter(lambda x: x % 2 == 0)

# 输出过滤后的结果
print(even_numbers.collect())

# 停止 PySpark 程序
sc.stop()

Results of the :

Y:\002_WorkSpace\PycharmProjects\pythonProject\venv\Scripts\python.exe Y:/002_WorkSpace/PycharmProjects/HelloPython/hello.py
23/08/02 21:07:55 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/02 21:07:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
PySpark 版本号 :  3.4.1
[2, 4, 6, 8]

Process finished with exit code 0

insert image description here





2. RDD#distinct method




1. Introduction to RDD#distinct method


The RDD#distinct method is used to deduplicate the data in the RDD and return a new RDD object;

The RDD#distinct method will not modify the original RDD object;


When in use, directly call the distinct method of the RDD object without passing in any parameters;

new_rdd = old_rdd.distinct()

In the above code, old_rdd is the original RDD object, and new_rdd is the new RDD object after deduplication of elements;


2. Code example - RDD#distinct method example


Code example:

"""
PySpark 数据处理
"""

# 导入 PySpark 相关包
from pyspark import SparkConf, SparkContext
# 为 PySpark 配置 Python 解释器
import os
os.environ['PYSPARK_PYTHON'] = "Y:/002_WorkSpace/PycharmProjects/pythonProject/venv/Scripts/python.exe"

# 创建 SparkConf 实例对象 , 该对象用于配置 Spark 任务
# setMaster("local[*]") 表示在单机模式下 本机运行
# setAppName("hello_spark") 是给 Spark 程序起一个名字
sparkConf = SparkConf() \
    .setMaster("local[*]") \
    .setAppName("hello_spark")

# 创建 PySpark 执行环境 入口对象
sc = SparkContext(conf=sparkConf)

# 打印 PySpark 版本号
print("PySpark 版本号 : ", sc.version)

# 创建一个包含整数的 RDD 对象
rdd = sc.parallelize([1, 1, 2, 2, 3, 3, 3, 4, 4, 5])

# 使用 distinct 方法去除 RDD 对象中的重复元素
distinct_numbers = rdd.distinct()

# 输出去重后的结果
print(distinct_numbers.collect())

# 停止 PySpark 程序
sc.stop()

Results of the :

Y:\002_WorkSpace\PycharmProjects\pythonProject\venv\Scripts\python.exe Y:/002_WorkSpace/PycharmProjects/HelloPython/hello.py
23/08/02 21:16:35 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/02 21:16:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
PySpark 版本号 :  3.4.1
Y:\002_WorkSpace\PycharmProjects\pythonProject\venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\shuffle.py:65: UserWarning: Please install psutil to have better support with spilling
Y:\002_WorkSpace\PycharmProjects\pythonProject\venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\shuffle.py:65: UserWarning: Please install psutil to have better support with spilling
Y:\002_WorkSpace\PycharmProjects\pythonProject\venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\shuffle.py:65: UserWarning: Please install psutil to have better support with spilling
Y:\002_WorkSpace\PycharmProjects\pythonProject\venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\shuffle.py:65: UserWarning: Please install psutil to have better support with spilling
Y:\002_WorkSpace\PycharmProjects\pythonProject\venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\shuffle.py:65: UserWarning: Please install psutil to have better support with spilling
[1, 2, 3, 4, 5]

Process finished with exit code 0

insert image description here

Guess you like

Origin blog.csdn.net/han1202012/article/details/132071321