PySpark data manipulation

data input

RDD object

As shown in the figure, PySpark supports the input of various data. After the input is completed, one will be obtained: an object of the RDD class

The full name of RDD is: Resilient Distributed Datasets

PySpark's data processing uses RDD objects as carriers, namely:

  • Data is stored in RDD
  • The calculation methods of various data are also member methods of RDD
  • The data calculation method of RDD, the return value is still an RDD object

 The programming model of PySpark (above) can be summarized as:

  • Prepare data to RDD -> RDD iterative calculation -> RDD export as list, text file, etc.
  • That is: source data -> RDD -> result data

Python data container to RDD object

PySpark supports the parallelize member method of the SparkContext object, which will:

  • list
  • tuple
  • set
  • dict
  • str

Convert to RDD object of PySpark

Notice:

  • The string will be split into 1 character and stored in the RDD object
  • Only the key of the dictionary will be stored in the RDD object

Read file to RDD object

PySpark also supports reading files through the SparkContext entry object to construct RDD objects.

 Summarize:

1. What is an RDD object? Why use it?

The RDD object is called a distributed elastic data set, which is the carrier of data calculation in PySpark. It can:

  • Provide data storage
  • Provide various methods of data calculation
  • The method of data calculation, the return value is still RDD (RDD iterative calculation)

Subsequent calculations on the data are based on RDD objects

2. How to input data to Spark (ie get RDD object)

  • Convert the Python data container to an RDD object through the parallelize member method of SparkContext
  • Through the textFile member method of SparkContext, read the text file to get the RDD object

data calculation

map method

PySpark's data calculations are all based on RDD objects, so how to do it?

Naturally, it is dependent, and RDD objects have rich built-in: member methods (operators)

grammar:

 

"""
 演示PySpark代码加载数据即数据输入
"""
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local[*]").setAppName("test_spark")
sc = SparkContext(conf=conf)

# # 通过parallelize方法将Python对象加载到Spark内,成为RDD对象
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = sc.parallelize((1, 2, 3, 4, 5))
rdd3 = sc.parallelize("abcdefg")
rdd4 = sc.parallelize({1, 2, 3, 4, 5})
rdd5 = sc.parallelize({"key1": "value1", "key2": "value2", "key3": "value3"})

# # 如果要查看RDD里边有什么内容,需要用collect()方法
print(rdd1.collect())
print(rdd2.collect())
print(rdd3.collect())
print(rdd4.collect())
print(rdd5.collect())
# 用过textFile方法,读取文件数据加载到Spark内,成为RDD对象
rdd = sc.textFile("E:/百度网盘/1、Python快速入门(8天零基础入门到精通)/资料/第15章资料/资料/hello.txt")
print(rdd.collect())
sc.stop()

Summarize:

 1. map operator (member method)

  • Accepts a processing function, which can be written quickly with lambda expressions
  • Process the elements in the RDD one by one and return a new RDD

2. Chain call

  • For an operator whose return value is a new RDD, the operator can be called multiple times through chain calls.

flatMap method

 

Summarize

flatMap operator

  • The calculation logic is the same as map
  • It can be more than map, and the function of releasing one layer of nesting

reduceByKey method

 

"""
PySpark代码加载数据reduceByKey方法
针对KV型 RDD
自动按照key分组,然后根据你提供的聚合逻辑完成组内数(value)的聚合操作.

二元元祖
"""

from pyspark import SparkConf, SparkContext
# 配置Python解释器
import os
os.environ['PYSPARK_PYTHON'] = "D:/Python/Python311/python.exe"

conf = SparkConf().setMaster("local[*]").setAppName("test_spark")
sc = SparkContext(conf=conf)

rdd = sc.parallelize([('男', 99), ('女', 88),('女',99), ('男',77), ('男', 55)])
# 需求,求男生和女生俩个组的成绩之和
rdd2 = rdd.reduceByKey(lambda a, b: a + b)
print(rdd2.collect())

 

 Summarize:

reduceByKey operator

  • Accept a processing function to perform pairwise calculations on the data

 

Practice Case 1

WordCount case

Using what you have learned, complete:

  • read file
  • Count the number of occurrences of a word in a file

hello.txt

itheima itheima itcast itheima
spark python spark python itheima
itheima itcast itcast itheima python
python python spark pyspark pyspark
itheima python pyspark itcast spark

"""
完成练习案例:单词计数统计
"""
# 1.构建执行环境入口对象
from pyspark import SparkConf, SparkContext
import os
os.environ['PYSPARK_PYTHON'] = "D:/Python/Python311/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark")
sc = SparkContext(conf=conf)
# 2.读取数据
rdd = sc.textFile("E:/百度网盘/1、Python快速入门(8天零基础入门到精通)/资料/第15章资料/资料/hello.txt")
# 3.取出全部单词
wor_rdd = rdd.flatMap(lambda a: a.split(" "))
# print(wor_rdd.collect())
# 4.将所有单词都转换成二元元组,单词为key,Value设置为1
word_with_one_rdd = wor_rdd.map(lambda word: (word, 1))
# print(word_with_one_rdd.collect())
# 5.分组并求和
result = word_with_one_rdd.reduceByKey(lambda a, b: a + b)
# 6.打印输出结果
print(result.collect())


结果:
[('python', 6), ('itheima', 7), ('itcast', 4), ('spark', 4), ('pyspark', 3)]

filter method

Function: Filter the desired data for retention

"""
PySpark代码加载数据Filter方法
"""
from pyspark import SparkConf, SparkContext
# 配置Python解释器
import os

os.environ['PYSPARK_PYTHON'] = "D:/Python/Python311/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark")
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7])
# 对RDD的数据进行过滤
rdd2 = rdd.filter(lambda num: num % 2 == 0)  # 整数返回true 奇数返回false

print(rdd2.collect())


结果:
[2, 4, 6]

 Summarize:

filter operator

  • Accepts a processing function, which can be written quickly with lambda
  • The function processes the RDD data one by one, and keeps the True value in the RDD of the return value

distinct method

Function: deduplicate RDD data and return a new RDD

grammar:

        rdd.distinct() No need to pass parameters

"""
PySpark代码加载数据distinct方法
去重  无需传参
"""
from pyspark import SparkConf, SparkContext
# 配置Python解释器
import os

os.environ['PYSPARK_PYTHON'] = "D:/Python/Python311/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark")
sc = SparkContext(conf=conf)

# 准备一个RDD
rdd = sc.parallelize([1, 1, 2, 3, 3, 3, 5, 6, 7, 7, 7, 7, 7])
# 对RDD的数据进行去重
rdd2 = rdd.distinct()
print(rdd2.collect())


结果:
[1, 2, 3, 5, 6, 7]

 Summarize:

 distinct operator

  • Complete the deduplication operation on the data in the RDD

sortBy method

Function: Sort RDD data, based on the sorting basis you specify

"""
PySpark代码加载数据sortBy方法
排序
语法:
rdd.sortBy(func,ascending=False, numPartitions=1)
# func:(T)U:告知按照rdd中的哪个数据进行排序,
比如lambda x:x[1]表示按照rdd中的第二列元素进行排序
# ascending True升序  False降序
# numPartitions:用多少分区排序
"""
# 1.构建执行环境入口对象
from pyspark import SparkConf, SparkContext
import os
os.environ['PYSPARK_PYTHON'] = "D:/Python/Python311/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark")
sc = SparkContext(conf=conf)
# 2.读取数据
rdd = sc.textFile("E:/百度网盘/1、Python快速入门(8天零基础入门到精通)/资料/第15章资料/资料/hello.txt")
# 3.取出全部单词
wor_rdd = rdd.flatMap(lambda a: a.split(" "))
# 4.将所有单词都转换成二元元组,单词为key,Value设置为1
word_with_one_rdd = wor_rdd.map(lambda word: (word, 1))
# 5.分组并求和
result = word_with_one_rdd.reduceByKey(lambda a, b: a + b)
# 6.打印输出结果
print(result.collect())
# 7.对结果进行排序
a = result.sortBy(lambda x: x[1], ascending=False, numPartitions=1)  # 降序
print(a.collect())

b = result.sortBy(lambda x: x[1], ascending=True, numPartitions=1)  # 升序
print(b.collect())


结果:
[('python', 6), ('itheima', 7), ('itcast', 4), ('spark', 4), ('pyspark', 3)]
[('itheima', 7), ('python', 6), ('itcast', 4), ('spark', 4), ('pyspark', 3)]
[('pyspark', 3), ('itcast', 4), ('spark', 4), ('python', 6), ('itheima', 7)]

 Summarize:

sortBy operator

  • Receive a processing function, which can be written quickly with lambda
  • The function indicates the basis for determining the sorting
  • Can control ascending or descending order
  • Global sorting needs to set the number of partitions to 1

Practice Case 2

the case

{"id":1,"timestamp":"2019-05-08T01:03.00Z","category":"平板电脑","areaName":"北京","money":"1450"}|{"id":2,"timestamp":"2019-05-08T01:01.00Z","category":"手机","areaName":"北京","money":"1450"}|{"id":3,"timestamp":"2019-05-08T01:03.00Z","category":"手机","areaName":"北京","money":"8412"} {"id":4,"timestamp":"2019-05-08T05:01.00Z","category":"电脑","areaName":"上海","money":"1513"}|{"id":5,"timestamp":"2019-05-08T01:03.00Z","category":"家电","areaName":"北京","money":"1550"}|{"id":6,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"杭州","money":"1550"} {"id":7,"timestamp":"2019-05-08T01:03.00Z","category":"电脑","areaName":"北京","money":"5611"}|{"id":8,"timestamp":"2019-05-08T03:01.00Z","category":"家电","areaName":"北京","money":"4410"}|{"id":9,"timestamp":"2019-05-08T01:03.00Z","category":"家具","areaName":"郑州","money":"1120"} {"id":10,"timestamp":"2019-05-08T01:01.00Z","category":"家具","areaName":"北京","money":"6661"}|{"id":11,"timestamp":"2019-05-08T05:03.00Z","category":"家具","areaName":"杭州","money":"1230"}|{"id":12,"timestamp":"2019-05-08T01:01.00Z","category":"书籍","areaName":"北京","money":"5550"} {"id":13,"timestamp":"2019-05-08T01:03.00Z","category":"书籍","areaName":"北京","money":"5550"}|{"id":14,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"北京","money":"1261"}|{"id":15,"timestamp":"2019-05-08T03:03.00Z","category":"电脑","areaName":"杭州","money":"6660"} {"id":16,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"天津","money":"6660"}|{"id":17,"timestamp":"2019-05-08T01:03.00Z","category":"书籍","areaName":"北京","money":"9000"}|{"id":18,"timestamp":"2019-05-08T05:01.00Z","category":"书籍","areaName":"北京","money":"1230"} {"id":19,"timestamp":"2019-05-08T01:03.00Z","category":"电脑","areaName":"杭州","money":"5551"}|{"id":20,"timestamp":"2019-05-08T01:01.00Z","category":"电脑","areaName":"北京","money":"2450"} {"id":21,"timestamp":"2019-05-08T01:03.00Z","category":"食品","areaName":"北京","money":"5520"}|{"id":22,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"北京","money":"6650"} {"id":23,"timestamp":"2019-05-08T01:03.00Z","category":"服饰","areaName":"杭州","money":"1240"}|{"id":24,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"天津","money":"5600"} {"id":25,"timestamp":"2019-05-08T01:03.00Z","category":"食品","areaName":"北京","money":"7801"}|{"id":26,"timestamp":"2019-05-08T01:01.00Z","category":"服饰","areaName":"北京","money":"9000"} {"id":27,"timestamp":"2019-05-08T01:03.00Z","category":"服饰","areaName":"杭州","money":"5600"}|{"id":28,"timestamp":"2019-05-08T01:01.00Z","category":"食品","areaName":"北京","money":"8000"}|{"id":29,"timestamp":"2019-05-08T02:03.00Z","category":"Apparel","areaName":"Hangzhou","money":"7000"}

 Requirements, copy the above content to the file, and use Spark to read the file for calculation:

  • Sales ranking of each city, from large to small
  • All cities, which product categories are on sale
  • What commodity categories are on sale in Beijing
"""
使用Spark读取文件进行计算:
各个城市销售额排名,从大到小
全部城市,有哪些商品类别在售卖
北京市有哪些商品类别在售卖
"""
import json
from pyspark import SparkConf, SparkContext
import os
os.environ['PYSPARK_PYTHON'] = "D:/Python/Python311/python.exe"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark")
sc = SparkContext(conf=conf)


# TOD0 需求1:城市销售额排名
# 1.1 读取文件得到RDD
rdd = sc.textFile("E:/百度网盘/1、Python快速入门(8天零基础入门到精通)/资料/第15章资料/资料/orders.txt")
# print(rdd.collect())

# 1.2取出一个个JSON字符串
json_str = rdd.flatMap(lambda a: a.split("|"))
# print(json_str.collect())

# 1.3将一个个JSON字符串转换为字典
my_dict = json_str.map(lambda x: json.loads(x))
# print(my_dict.collect())

# 1.4 取出城市和销售额数据
# (城市,销售额)
city_with_money_rdd = my_dict.map(lambda x: (x["areaName"], int(x['money'])))
# print(city_with_money_rdd.collect())

# 1.5 按城市分组按销售额聚合
city_result = city_with_money_rdd.reduceByKey(lambda a, b: a + b)
# print(city_result.collect())

# 1.6 按销售额聚合结果进行排序
sorting = city_result.sortBy(lambda a: a[1], ascending=False, numPartitions=1)
print(f"需求1的结果是:{sorting.collect()}")

# TODD 需求2:全部城市,有哪些商品类别在售卖
city_with_category_rdd = my_dict.map(lambda x: (x['category'])).distinct()
print(f"需求2的结果是:{city_with_category_rdd.collect()}")

# TODD 需求3:北京市有哪些商品类别在售卖
# 3.1 过滤北京是市的数据
bj_dict = my_dict.filter(lambda a: a["areaName"] == "北京")
# print(bj_dict.collect())

# # 3.2 取出全部商品列表
# bj_category = bj_dict.map(lambda a: a["category"])
# print(bj_category.collect())

# # 3.3 进行商品类别去重
# bj_category_distinct = bj_category.distinct()
# print(f"北京市售卖的商品有{bj_category_distinct.collect()}")

# 3.2 取出全部商品列表  进行商品类别去重
bj_category = bj_dict.map(lambda a: a["category"]).distinct()
print(f"北京市售卖的商品有{bj_category.collect()}")



结果:
需求1的结果是:[('北京', 91556), ('杭州', 28831), ('天津', 12260), ('上海', 1513), ('郑州', 1120)]


需求2的结果是:['电脑', '家电', '食品', '平板电脑', '手机', '家具', '书籍', '服饰']


北京市售卖的商品有['家电', '电脑', '食品', '平板电脑', '手机', '家具', '书籍', '服饰']

data output

output as a Python object

data input:

  • sc.parallelize
  • sc.textFile

Data calculation:

  • rdd.map
  • rdd.flatMap
  • rdd.reduceByKey
  • ...

collect operator

Function: Collect the data in each partition of rdd into Driver to form a List object

usage:

rdd.collect()

returns a List

reduce operator

Function: aggregate the RDD data set according to the logic you pass in

The return value is equivalent to the return value of the calculation function

take operator

 Function: Take the first N elements of RDD, combine them into a List and return it to you

usage:

sc.parallelize([1, 2, 65, 5, 8, 841, 2, 48, 12, 21, 48]).take(6)


结果:
[1, 2, 65, 5, 8, 841]

 count operator

Function: Calculate how many pieces of data RDD has, and the return value is a number

Summarize:

1. The programming process of Spark is:

  • Load data as RDD (data in)
  • Calculation on RDD (data calculation)
  • Convert RDD to Python object (data output)

2. Data output method

  • collect: Convert RDD content to list
  • reduce: custom aggregation of RDD content
  • take: Take out the first N elements of the RDD to form a list
  • count: Count the number of RDD elements

There are many methods available for data output, four of which are briefly introduced.

output to file

saveAsTextFile operator

Function: Write the data of RDD into a text file

Support local writing, hdfs and other file systems

code:

rdd = sc.parallelize([1, 2, 3, 4, 5, 6])
rdd.saveAsTextFile("D:/output")

Precautions

To call the operator that saves the file, you need to configure Hadoop dependencies

Download the Hadoop installation package

  • Unzip anywhere on your computer
  • Use the os module configuration in Python code: os.environ['HADOOP_HOME'] = 'HADOOP decompression folder path'
  • Download winutils.exe and put it in the bin directory of the Hadoop decompression folder
  • Download hadoop.dll and put it in: C:/Windows/System32 folder

 Modify the rdd partition to 1

Method 1, the SparkConf object sets the attribute global parallelism to 1:

conf = SparkConf().setMaster("local[*]").setAppName("test_spark")
# 设置spark全局并行度为1
conf.set('spark.default.parallelism', '1')

sc = SparkContext(conf=conf)

Method 2, set when creating RDD (parallelize method passes in numSlices parameter as 1):

rdd1 = sc.parallelize([1, 2, 3, 4, 5, 6], numSlices=1)  # 设置分区为1

rdd1 = sc.parallelize([1, 2, 3, 4, 5, 6], 1)  # 设置分区为1

Summarize:

1. How to output RDD to file

  • rdd.saveAsTextFile(path)
  • The output result is a folder
  • Output as many result files as there are partitions

2. How to modify the RDD partition

  • SparkConf object setting conf.set("spark.default.parallelism", "1")
  • When creating an RDD, the sc.parallelize method passes in the numSlices parameter as 1

Guess you like

Origin blog.csdn.net/qq1226546902/article/details/132038658