Pythonspark pyspark

insert image description here
insert image description here
Spark computing engine

# 导包
from pyspark import SparkConf, SparkContext

#  设置环境变量
import os
#  设置pyspark 解析器
os.environ['PYSPARK_PYTHION'] = 'D:\dev\python 3.11.4'
# 创建SparkConf类对象
#   运行模式setMaster()可以设置分布式集群
#	setAppName()  设置conf名称
conf = SparkConf().setMaster("Local[*]").setAppName("test_spark_app")

# 基于SparkConf类对象创建SparkContext类对象做为入口   执行环境入口对象 sparkcontext
sc = SparkContext(conf=conf)
# 打印PySpark的运行版本
print(sc.version)
# rdd对象  通过sparkcontext的parallelize方法 把python数据容器(list、tuple、set、dict、str)转换为RDD对象
rdd = sc.parallelize(数据容器对象)
#  读取文件  转换成rdd对象
rdd = sc.textFile(文件路径)
#  输出RDD对象
#  print(rdd)  不会打印输出,  print打印只能打印python对象   rdd.collect() 把rdd 转换成python对象
print(rdd.collect())

# 停车SparkContenxt对象的运行(停车Pyspark程序)
sc.stop()

insert image description here
insert image description here

spark data processing

insert image description here
insert image description here

map, flatmap
flatmap has the same effect as map, but flatmap has one more unnesting result.

reduceBykey

Binary tuples, also known as KV tuples, have only two elements. (('a',1),('b',2))
can be grouped by key, and each group, value, will be calculated together, that is, two calculations

insert image description here
insert image description here
insert image description here

Output Data

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/u013400314/article/details/131259389