pyspark实践之 map/flatMap应用示例

1、map用法示例

PySpark map() Transformation - Spark By {Examples}

        1.1 比较map和foreach的功能异同

                 PySpark foreach() Usage with Examples - Spark By {Examples}

        1.2 比较map和apply的功能异同

                 PySpark apply Function to Column - Spark By {Examples}

        1.3 比较map和transform的功能异同

                PySpark transform() Function with Example - Spark By {Examples}

2、 flatMap的用法示例

PySpark flatMap() Transformation - Spark By {Examples}

 1、map用法示例

     语法:

map(f, preservesPartitioning=False)
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
    .appName("SparkByExamples.com").getOrCreate()

# 1对于rdd
data = ["Project","Gutenberg’s","Alice’s","Adventures",
"in","Wonderland","Project","Gutenberg’s","Adventures",
"in","Wonderland","Project","Gutenberg’s"]

rdd=spark.sparkContext.parallelize(data)

rdd2=rdd.map(lambda x: (x,1))
for element in rdd2.collect():
    print(element)

# 2对于DataFrame。 
data = [('James','Smith','M',30),
  ('Anna','Rose','F',41),
  ('Robert','Williams','M',62), 
]

columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|    30|
|     Anna|    Rose|     F|    41|
|   Robert|Williams|     M|    62|
+---------+--------+------+------+


# pyspark的DataFrame没有map()方法,需要先转成rdd再用
# 2.1 Refering columns by index. 2.1列引用可用index索引
rdd2=df.rdd.map(lambda x: 
    (x[0]+","+x[1],x[2],x[3]*2)
    )  
df2=rdd2.toDF(["name","gender","new_salary"]   )
df2.show()
+---------------+------+----------+
|           name|gender|new_salary|
+---------------+------+----------+
|    James,Smith|     M|        60|
|      Anna,Rose|     F|        82|
|Robert,Williams|     M|       124|
+---------------+------+----------+

#2.2 也可以使用列名进行索引
# Referring Column Names
rdd2=df.rdd.map(lambda x: 
    (x["firstname"]+","+x["lastname"],x["gender"],x["salary"]*2)
    ) 


# Referring Column Names 或者如下索引
rdd2=df.rdd.map(lambda x: 
    (x.firstname+","+x.lastname,x.gender,x.salary*2)
    ) 

# 2.3 调用函数
# By Calling function
def func1(x):
    firstName=x.firstname
    lastName=x.lastname
    name=firstName+","+lastName
    gender=x.gender.lower()
    salary=x.salary*2
    return (name,gender,salary)

rdd2=df.rdd.map(lambda x: func1(x))

        1.1 比较map和foreach的功能异同

             

        1.2 比较map和apply的功能异同

        1.3 比较map和transform的功能异同

      

2、 flatMap的用法示例

 语法:

flatMap(f, preservesPartitioning=False)
# 1 对于RDD应用flatMap函数
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = ["Project Gutenberg’s",
        "Alice’s Adventures in Wonderland",
        "Project Gutenberg’s",
        "Adventures in Wonderland",
        "Project Gutenberg’s"]
rdd=spark.sparkContext.parallelize(data)

#Flatmap方法    
rdd2=rdd.flatMap(lambda x: x.split(" "))
for element in rdd2.collect():
    print(element)
# 2 对于DataFrame使用flatMap函数
cols_tmp = ["user_id", "cate_cd", "shop_id", "sku_id", "window_type"]
df = df_tmp.rdd.flatMap(lambda x: [(x.user_id, x.cate_cd, x.shop_id, x.sku_id, window_type) for window_type in x.window_type_str.split(",")]).toDF(cols_tmp)

# 双变量列表推导式
cols_name = ["project",  "id", "name",  "start_time",  "end_time"]
df.rdd.flatMap(lambda x: [(x.project, id, name, x.start_time, x.end_time) for id in x.id_list.split(",") for name in x.name_list.split(",")])
            .toDF(cols_name)   
# 3 使用pyspark.sql中的explode函数来代替相同功能
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('pyspark-by-examples').getOrCreate()

arrayData = [
        ('James',['Java','Scala'],{'hair':'black','eye':'brown'}),
        ('Michael',['Spark','Java',None],{'hair':'brown','eye':None}),
        ('Robert',['CSharp',''],{'hair':'red','eye':''}),
        ('Washington',None,None),
        ('Jefferson',['1','2'],{})]
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages','properties'])

from pyspark.sql.functions import explode
df2 = df.select(df.name,explode(df.knownLanguages))
df2.printSchema()
df2.show()

猜你喜欢

转载自blog.csdn.net/eylier/article/details/128718457
今日推荐