Pyspark study notes summary

pyspark official documentation: https://spark.apache.org/docs/latest/api/python/index.html
pyspark case tutorial: https://sparkbyexamples.com/pyspark-tutorial/

1. Write in front

This article records some commonly used grammars about Pyspark and using spark sql to process large-scale data recently. I always thought that pandas was a powerful tool for data analysis and data mining. ), it was discovered that ordinary pandas is almost useless, so it is necessary to re-explore pyspark, and I also got in touch with it in school, but most of them are related to environment construction. For data processing, Data analysis and so on are not in-depth at all, so after work, I plan to dig a hole related to big data development, and then slowly fill in it from the aspects of daily use and underlying principles.

During this period of time, I happened to use pyspark's spark dataframe to do some data analysis and processing work, so combined with the use of this period of time, I sorted out some commonly used grammars, which is convenient for reviewing and practicing later. There will be operations on spark's dataframe later. , integrated together. Like git, this time it still adopts the style of demand plus processing, which can really apply what you have learned.

main content:

  • A brief introduction to Pyspark
  • RDD articles
  • DataFrame

Ok, let’s go!

1. Introduction to Pyspark

1.1 Background information

Pyspark is a library (Spark's Python API) that can run python programs on Spark. With it, python programs can be run in parallel on distributed clusters.

Apache Spark is an analytical processing engine for powerful distributed data processing and machine learning applications at scale.

Originally, Spark was written in Scala, but later, python was widely used in the industry. Based on Py4J, a version of python was released. This is a Java package integrated in PySpark, which allows python to automatically connect with JVM objects, so Running Pyspark also requires Java and spark. This is why when running the pyspark program, you need to specify the java home, spark home, and the path where the py4j package is located . For example, a configuration of the pyspark program I ran before:

import os
import sys   # sys.path是python的搜索模块的路径集,是一个list

os.environ['SPARK_HOME'] = '/usr/local/spark'
sys.path.append('/usr/local/spark/python')
sys.path.append('/home/hduser/anaconda3/envs/bigdata_env/bin/python3.5')
sys.path.append('/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip')

Advantages: The core is fast. After all, clusters can be used, parallelism, memory-based operations, fault tolerance, and inertia

  • Distributed data processing, fast
  • Can easily process data in cloud storage file systems, such as HDFS, AWS, S3
  • Real-time data can also be processed with Streaming and Kafka
  • Using Pyspark Streams, it is also possible to stream files from the file system, and also stream from sockets.
  • Pyspark also has some machine learning and graphics libraries

pyspark architecture:
insert image description here
master-slave architecture (dirver-worker), when running a spark program, spark driver creates a context (program entry), specific operations (transformations and actions) are executed on worker nodes, and resources are unified by the cluster manager manage.

1.2 Basic components

Regarding the installation of pyspark, I sorted it out when I was in school, and I wrote it here. Here is a list of the components and packages of pyspark, and look at what pyspark has from a macro perspective.
insert image description here

1.2.1 pyspark RDD

Pyspark's basic data structure, fault-tolerant, immutable collection of distributed objects, once created cannot be changed. Each dataset in an RDD is divided into logical partitions (which can be computed on different nodes of the cluster).
To create an RDD, you first need to create a SparkSession (the entry point of the Pyspark application)

# Import SparkSession
from pyspark.sql import SparkSession

# Create SparkSession 
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("SparkByExamples.com") \
      .getOrCreate() 

# 另外一种写法 这种写法可以通过指定conf来设置spark的运行模式(local,cluster)
# local[*]和上面的这个都是本地模式, 这时候会读取spark/conf下面的default.spark_conf
# 所以如果需要做一些配置的话,比如设置excutor的内存大小等,可以在这个spark_conf下面加
conf = SparkConf().setAppName('test').setMaster("local[*]")
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = SparkContext

SparkSession creates a SparkContext variable internally. We can create multiple SparkSession objects, but each virtual machine can only have one SparkContext. If you want to create another SparkContext, you must use a method to stop the existing one, otherwise it will report an error that SparkContext already exists stop(). , this is easy to appear when running code in jupyter notebook.

Creating an RDD can be created from a list, or from some specified data sources, or from spark's DataFrame.

# Create RDD from parallelize base a list
dataList = [("Java", 20000), ("Python", 100000), ("Scala", 3000)]
rdd=spark.sparkContext.parallelize(dataList)

# Create RDD from external Data source
rdd2 = spark.sparkContext.textFile("/path/test.txt")

rdd3 = spark_df.rdd

Operations on RDDs:

  • Transformations : Lazy operation. When running a transformation, it actually returns another RDD. There is no specific operation, but the operation is recorded. such as flatMap()、map()、reduceByKey()、fliter()、sortByKey()waiting
  • Actions : At this time, the calculation is started, and the result is returned to the driver. such as count()、collect()、first()、max()、reduce()waiting

This feels to me like the static graph mechanism in tensorflow1.x. Transformations are similar to the graph building process, and Actions are similar to opening Sessions and then executing them.

1.2.2 Pyspark DataFrame

The DataFrame here is a distributed data combination, similar to the pandas data table, but with richer functions. Can be created from structured data, hive tables, external databases or existing RDDs.

PySpark DataFrame is very similar to Pandas DataFrame, but the difference is that PySpark DataFrames are distributed in the cluster (meaning the data in the DataFrame is stored in different machines in the cluster), and any operation in Pyspark is executed in parallel on all machines, while PySpark DataFrame is stored and manipulated on one machine.

Regarding the creation method, we will organize it in detail in the DataFrame section later. DataFrame operations are also divided into Transformations and Actions.

1.2.3 Pyspark SQL

This is very common, and you can directly write sql statements to process DataFrame data. But before operating the DataFrame, you need to register the DataFrame to be operated as a temporary table createTempView().

df.createTempView("PERSON_DATA")
df2 = spark.sql("SELECT * from PERSON_DATA")
df2.printSchema()
df2.show()

groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")
groupDF.show()

# 不用的时候,一定要删除掉临时表,否则临时表重名了就会报错
spark.catalog.dropTempView("PERSON_DATA")

1.2.4 Other components

The others are not used much, just a brief introduction.

  1. PysparkStreaming

    This is a scalable, high-throughput, fault-tolerant stream processing system that supports batch and workflow. For processing real-time data.
    insert image description here
    Interaction:

    # 从tcp socket中读取流数据
    df = spark.readStream
          .format("socket")
          .option("host","localhost")
          .option("port","9090")
          .load()
    # kafka中读取流数据
    df = spark.readStream
            .format("kafka")
            .option("kafka.bootstrap.servers", "192.168.1.100:9092")
            .option("subscribe", "json_topic")
            .option("startingOffsets", "earliest") // From starting
            .load()
    # 写入数据到kafka
    df.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value")
       .writeStream
       .format("kafka")
       .outputMode("append")
       .option("kafka.bootstrap.servers", "192.168.1.100:9092")
       .option("topic", "josn_data_topic")
       .start()
       .awaitTermination()
    
  2. Pyspark MLlib
    This is a machine learning library that can run some simple machine learning models on spark. There are not many contacts with this at present, and I will come back to make up for it later.

  3. Pyspark GraphFrames graph computing library

1.3 SparkSession and SparkContext

1.3.1 SparkSession

SparkSession is the entry point of Pyspark, and it is a step that must be taken before using RDD, DataFrame or Dataset.

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \   # 集群模式 local[x]的x要大于0,表示用RDD或者Dataframe的时候,有多少partitions, 理想情况是CPU核数个
                    .appName('SparkByExamples.com') \ # 应用名称
                    .getOrCreate()  # 如果有SparkSession就返回,没有就创建新的
                   
# Usage of config()
spark = SparkSession.builder \
      .master("local[1]") \    
      .appName("SparkByExamples.com") \
      .config("spark.some.config.option", "config-value") \ # 这里可以设置spark的配置,或让他读取配置文件
      .getOrCreate()

# Set Config
spark.conf.set("spark.executor.memory", "5g")
# Get a Spark Config
partitions = spark.conf.get("spark.sql.shuffle.partitions")
print(partitions)

# Enabling Hive to use in Spark 能使用Hive
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("SparkByExamples.com") \
      .config("spark.sql.warehouse.dir", "<path>/spark-warehouse") \
      .enableHiveSupport() \
      .getOrCreate()

If you want to use the cluster mode, the master needs to specify the address of a cluster, and the various environments of spark have been built in the cluster.

import os

# 配置spark driver和pyspark运行时,所使用的python解释器路径
import sys   # sys.path是python的搜索模块的路径集,是一个list
os.environ['JAVA_HOME'] = '/opt/bigdata/java/jdk1.8'
os.environ['SPARK_HOME'] = '/opt/bigdata/spark/spark2.2'
os.environ['PYSPARK_PYTHON'] = '/opt/bigdata/anaconda3/envs/bigdata_env/bin/python3.7'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/opt/bigdata/anaconda3/envs/bigdata_env/bin/python3.7'
sys.path.append('/opt/bigdata/spark/spark2.2/python')
sys.path.append('/opt/bigdata/spark/spark2.2/python/lib/py4j-0.10.4-src.zip')
sys.path.append('/opt/bigdata/anaconda3/envs/bigdata_env/bin/python3.7')


# spark 配置信息
from pyspark import SparkConf
from pyspark.sql import SparkSession

SPARK_APP_NAME = "ALSRecommend"
SPARK_URL = "spark://192.168.56.101:7077"    # 这个不要写错

conf = SparkConf()    # 创建spark config对象
config = (
    ("spark.app.name", SPARK_APP_NAME),    # 设置启动的spark的app名称,没有提供,将随机产生一个名称
    ("spark.executor.memory", "6g"),    # 设置该app启动时占用的内存用量,默认1g
    ("spark.master", SPARK_URL),    # spark master的地址
    ("spark.executor.cores", "4"),    # 设置spark executor使用的CPU核心数
    # 以下三项配置,可以控制执行器数量
#     ("spark.dynamicAllocation.enabled", True),
#     ("spark.dynamicAllocation.initialExecutors", 1),    # 1个执行器
#     ("spark.shuffle.service.enabled", True)
#     ('spark.sql.pivotMaxValues', '99999'),  # 当需要pivot DF,且值很多时,需要修改,默认是10000
)

# 查看更详细配置及说明:https://spark.apache.org/docs/latest/configuration.html
conf.setAll(config)

# 利用config对象,创建spark session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

1.3.2 SparkContext

SparkContext is the functional entry point of pyspark for communicating with the cluster, creating RDDs, accumulators and broadcasting variables. Only one can be created per JVM.

The Spark Driver program creates and uses SparkContext to connect to the cluster manager to submit Pyspark jobs and knows which resource manager (YARN, MESOS or Standalone) to communicate with. This is the heart of the Pyspark application.

# Create SparkSession from builder
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
print(spark.sparkContext)
print("Spark App Name : "+ spark.sparkContext.appName)

# Outputs
#<SparkContext master=local[1] appName=SparkByExamples.com>
#Spark App Name : SparkByExamples.com


# SparkContext stop() method
spark.sparkContext.stop()

# 还可以直接创建

# Create Spark Context
from pyspark import SparkConf, SparkContext
conf = SparkConf()
conf.setMaster("local").setAppName("Spark Example App")
sc = SparkContext.getOrCreate(conf)
print(sc.appName)

After this operation, spark can be used to operate normally later.

2. RDD articles

RDD (Resilient Distributed Dataset) is the basic block of pyspark, which is fault-tolerant and immutable. Each record in RDD is divided into logical partitions, that is, operations can be performed on different cluster nodes.

If the DataFrame in pyspark is similar to pandas, then this thing is very similar to the list in python. The difference is still that RDD can be distributed in multiple machines, while the latter can only be processed in one process.
RDD provides data abstraction partitioning of how data runs on multiple machines, and for users, the process of distributing data to multiple machines through partitioning to obtain results after parallel calculation is a black box, and you can just use it .

insert image description here
Features:

  • Based on memory processing, this is the difference from mapreduce
  • Once created, it is immutable. When applying the transformamtions operation on the RDD, pyspark creates a new RDD
  • There is a fault-tolerant mechanism, which will automatically recover and retry
  • Lazy mechanism, during transformation, only records the operation, does not actually execute it, and only starts to execute when it encounters actions
  • When creating an RDD, it will be partitioned, and the default is the number of available cores

Pyspark RDDs are not suitable for applications that perform updates on state storage, such as storage systems for web applications. For these applications, it is more efficient to use a system such as a database that performs traditional update logging and data checkpointing. The purpose of RDDs is to provide an efficient programming model for batch analysis and preserve asynchronous applications.
Creation method: existing collection or external storage system (HDFS)

2.1 RDD creation

Creation method: existing collection or external storage system (HDFS)

#Create RDD from parallelize   如上图
data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd=spark.sparkContext.parallelize(data, partitionNums)  # 后面可以指定分区数量

#Create RDD from external Data source
rdd2 = spark.sparkContext.textFile("/path/textFile.txt")

# 上面这两个方法创建RDD的时候, 会自动把数据划分到分区上, 下面这个可以看分区数量
print("initial partition count:"+str(rdd.getNumPartitions()))
#Outputs: initial partition count:2

2.2 RDD partition

RDD can be repartitioned according to our specification, pyspark provides two methods: repartition and coalesce

  1. repartition: repartition the data according to the specified number of partitions, shuffle operation of the full amount of data
  2. coalesce: change the smallest shuffle, for example, now there are 4 partitions, execute coalease(2), at this time, move the data of two partitions to the other two, and for repartition, first mix all the data and then re-scatter them into two
rdd1.saveAsTextFile("c://tmp/partition2")

rdd2 = rdd1.repartition(4)
print("Repartition size : "+str(rdd2.getNumPartitions()))
rdd2.saveAsTextFile("c://tmp/re-partition2")

rdd3 = rdd1.coalesce(4)
print("Repartition size : "+str(rdd3.getNumPartitions()))
rdd3.saveAsTextFile("c:/tmp/coalesce2")

For DataFrame, there are also these two partition operations, and the usage method is basically the same as this

df2 = df.repartition(6)
df3 = df.coalesce(2)

There is one point to note here. When calling and similar functions on DataFramegroupBy(),union()join() , it will cause re-shuffle data partitioning . By default, it will be re-partitioned into 200 partitions. You can also configure spark.sql.shuffle.partitions according to a spark. Adjustment.

2.3 Operation of RDD

As mentioned above, RDD mainly includes two operations:

  • Transformations : Lazy operation. When running a transformation, it actually returns another RDD. There is no specific operation, but the operation is recorded. such as flatMap()、map()、reduceByKey()、fliter()、sortByKey()waiting
  • Actions : At this time, the calculation is started, and the result is returned to the driver. such as count()、collect()、first()、max()、reduce()waiting

The classic case here is workcount. This can analyze what the above operations are doing in each step. After knowing the principle, I will sort out a requirement that is commonly used in actual scenarios and analyze the important details. .

Ok, let’s go to the Jupyter Notebook built by the capitalist for free , and quickly play with the above operations.

2.3.1 workcount

Here we first use the simplest workcount task to introduce the principles of the above operations. There is no txt, so let’s simply write something as an example.

from pyspark.sql import SparkSession
# 创建spark会话,并获取spark的山下文
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

# 从一个字符串列表创建rdd出来,背后就是把这三句话分区,放到了不同的机器上面
words = [
    "I love me", 
    "This is a start for me about pyspark",
    "pyspark is interesting and helpful for me"
]
rdd = sc.parallelize(words)

# workcount: 第一步, 首先利用flatMap操作, 对每句话操作
# 对每句话, 要根据空格拆分成多个单词返回回来, 也就是输入是一条记录,返回的时候成了多条记录,这种操作用flatMap
rdd1 = rdd.flatMap(lambda x: x.split(" "))

# workcount: 第二步, map操作,对每个单词初始化计数器, map是输入一条记录,返回一条记录
rdd2 = rdd1.map(lambda x: (x,1))

# workcount: 第三步, reduceByKey操作, 对每个键的值,按照指定函数操作,下面这个就是相同key的值相加统计个数
rdd3 = rdd2.reduceByKey(lambda a, b: a + b)

# workcount: 第四步, 按照个数从小到大对单词排序
rdd4 = rdd3.map(lambda x: (x[1], x[0])).sortByKey()

# 只保留个数多于1的单词
rdd5 = rdd4.filter(lambda x : x[0] > 1)

After this set of transformations is combined, the word frequency statistics of words are completed. The next step is to use actions to actually execute and then output.

print(rdd5.collect())

In this way, the statistics come out:
insert image description here
Here you can also count the total number of words with a number greater than 2:

rdd5.count()  # 4
rdd5.first()   # 返回第一个
rdd5.take(3)  # 返回前3个
rdd5.reduce(lambda x, y: (x[0]+y[0],))  # 统计一共多少个单词

2.3.2 Real scene

Then, for these operations of transformations of rdd, what are the requirements in real scenarios, and how to operate them?

Scenario : When building a data warehouse, I have tens of thousands of parquet files, and each parquet file contains tens of thousands of records. At this time, how should I read these data, and then do some data analysis, processing or writing How about entering the Hive table?

This is actually very common in the case of big data. This logic is actually relatively simple. Using the content learned above, it can be done with a few lines of code. Let’s analyze it first.

Analysis : There are tens of thousands of parquet files. We put these files into a path_list list first, and then for each path, we go to the corresponding path to read the parquet data, which will return a list of tens of thousands of records. Different paths can be assigned to different machines to complete this operation, and then the final records are aggregated and merged to obtain hundreds of millions of records. For each record, it is preprocessed and converted into a spark DataFrame for distributed mining, analysis, storage, etc.

So here is actually a very important paradigm in big data reading, very general, translated into code as follows:

# 这里需要Row对象,这里对应着dataframe的一条记录,可以提前定义好列
from pyspark.sql import Row
import pyarrow.parquet as pq

# 定义下DataFrame的列的名称
data_row = Row('name', 'age', 'sex', 'address', 'phone')

# 给定一个parquet的文件路径,返回下面的所有记录
def read_parquet_data(parquet_path):
	# 把parquet文件转成pd的dataframe
	df = pq.read_table(parquet_path).to_pandas()
	data = json.loads(df.to_json(orient="records"))
    res = []
    for i in data:
        if not i:
            continue
        res.append(data_row(i.get("name"), i.get("age"), i.get("sex"), i.get("address"), i.get('phone')))
    return res

# 处理每条记录, 传入的是一个row对象
def process(item):
	new_item = Row('name', 'age', 'sex', 'address', 'phone', 'class')
	# 可以通过item.name, item.age, item.sex等获取原始数据,这里可以做一些操作
	if item.phone.startwith('178'):
		class = 1
	else: 
		class = 2
	return new_item(item.name, item.age, item.sex, item.address, item.phone, item.class)

# 下面这个是核心,这里的map也可以换成mapPartitions
raw_data_rdd = sc.parallelize(parquet_path_list).flatMap(read_parquet_data).map(process) 
df = spark.createDataFrame(raw_data_rdd)  # 如果rdd的列结构类型复杂,这里推断不出来就需要手动指定schema

# 下面就可以基于dataframe做一些数据处理分析等了。

Note: Here the process accepts a Row object, which cannot be used item.属性名to change the value directly. If you want to change the value of the attribute name, you need to convert the Row object into a dictionary first, which is actually very common.

def process_row(row):
	# 转成字典
	d = row.as_Dict()
	# 对各个特征一顿处理
	d['a'] = 
	d['b'] = 
	return Row(**d)

spark.createDataFrame(df.rdd.map(process_row))

2.3.3 Dissecting a small detail

The operations in rdd are very commonly used map, flatMap and mapPartitions. What is the difference between these three? Here is a brief summary:

  1. map: operate on each record in the RDD, and return a record , that is, receive a row object, process the data, and return a row object (value)
  2. mapPartitions: This is the same as map, but it operates on partition iterators. If it is an ordinary map, for example, there are 10,000 pieces of data in a partition. Then the function needs to be executed and calculated 10,000 times. After using the mapPartitions operation, a task will only execute the function once, and the function will receive all the partition data at one time. It only needs to be executed once, and the performance is relatively high. The disadvantage is that the latter is easy to explode the memory when the partition data volume is large.
  3. flatMap: The difference between this and the above two is flat. This will flatten the result. Generally speaking, it operates on each record of RDD, but returns multiple records, that is, it receives a row object and returns a list of row objects. .

2.4 Broadcast variables

Finally, sort out the broadcast in RDD. This application scenario is, for example, to process each record of the data based on a certain mapping, and map a column value into another value. Often, the broadcast variable can be used to broadcast the dictionary. Going to each executor can speed up execution efficiency and reduce communication.

Scenario : There are hundreds of millions of records, and I want to replace the country abbreviation of a column of data with the full name

This directly takes the official example:

# 先定义一个转换规则的字典
states = {
    
    "NY":"New York", "CA":"California", "FL":"Florida"}

#  声明成广播变量
# 不加这一行其实也可以做,只不过那样excutor就需要和master通信
broadcastStates = spark.sparkContext.broadcast(states)  

data = [("James","Smith","USA","CA"),
    ("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),
    ("Maria","Jones","USA","FL")
  ]

rdd = spark.sparkContext.parallelize(data)

def state_convert(code):
    return broadcastStates.value[code]

result = rdd.map(lambda x: (x[0],x[1],x[2],state_convert(x[3]))).collect()

3. DataFrame

3.1 DataFrame Creation

The first is to create a DataFrame. Here I roughly sort out three common methods.

3.1.1 RDD creation

When creating based on RDD, two functions toDF()andcreateDataFrame()

# 从rdd创建
df = rdd.toDF()  # 自己推断schema, 此时schema _1, _2, _3这样的形式

# 指定schema
columns = ["language","users_count"]
dfFromRDD1 = rdd.toDF(columns)
dfFromRDD1.printSchema()

# createDataFrame
dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)

Specify schema when creating, very commonly used

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)

The meaning of schema is to specify the column name of the DataFrame, what type it is, whether it is allowed to be empty, etc., because the DataFrame of spark and the DataFrame of pandas are different in storage, so after specifying the schema here, spark will know how to save these columns. For simple data types, Spark's DataFrame can also infer, but if it is complex, you must specify the schema.

3.1.2 Reading various data sources

This is also very commonly used. After all, for big data, you can't write an RDD list first. However, in the above paradigm of reading data, a sample of reading parquet data, obtaining rdd and converting it into DataFrame has been written. The following is to read data from common file formats to obtain DataFrame.

# csv, textg, json, xml, sql读表
df2 = spark.read.text("/src/resources/file.txt")
df2 = spark.read.json("/src/resources/file.json")
df2 = spark.sql("select * from table where xxx is xxx")

df2 = spark.read.csv("/src/resources/file.csv")
# csv的另外写法
df2 = spark.read.format("csv").load("src/resouces/file.csv")
# 还可以加一些选项
df2 = spark.read.option("header"=True, delimiter=',', inferSchema='True') \
     .csv("/tmp/resources/zipcodes.csv")
# 这里面还有很多选项, quotes, nullValues, dataFormat等
# 下面这个也很常用
schema = StructType() \
      .add("RecordNumber",IntegerType(),True) \
      .add("Zipcode",IntegerType(),True) \
      .add("ZipCodeType",StringType(),True) \
      .add("City",StringType(),True) \
      .add("State",StringType(),True) \
      .add("Notes",StringType(),True)
      
df_with_schema = spark.read.format("csv") \
      .option("header", True) \
      .schema(schema) \
      .load("/tmp/resources/zipcodes.csv")

3.1.3 Interchange

The mutual conversion here includes RDD and Spark's DF, pandas' DF and Spark's DF

# rdd 与 spark df
df = rdd.toDF(schema=xxx)
df = spark. createDataFrame(rdd, schema=xxx)
rdd = df.rdd

# pandas df 与 spark df
pandf_val = pandf.values.tolist()
pandf_col = list(pandf.columns)
spark_df = spark.createDataFrame(pandf_val, schema=pandf_col)

pandf = spark.df.toPandas()   # 注意,数据量不宜太大,否则会爆内存

3.1.4 StructType & StructField

The StructType & StructField class is used to programmatically specify the frame for data and create complex columns (such as nested structures, arrays, and map columns) StructField
defines the column type of DataFrame, and StructType is a collection of StructField, commonly used data column types , commonly used nesting structure:

structureData = [
    (("James","","Smith"),"36636","M",3100),
    (("Michael","Rose",""),"40288","M",4300),
    (("Robert","","Williams"),"42114","M",1400),
    (("Maria","Anne","Jones"),"39192","F",5500),
    (("Jen","Mary","Brown"),"","F",-1)
  ]
structureSchema = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
         StructField('id', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', IntegerType(), True)
         ])

df2 = spark.createDataFrame(data=structureData,schema=structureSchema)
df2.printSchema()
df2.show(truncate=False)

Note: ArrayType or MapType also specifies the type for each element

arrayStructureSchema = StructType([
    StructField('name', StructType([
       StructField('firstname', StringType(), True),
       StructField('middlename', StringType(), True),
       StructField('lastname', StringType(), True)
       ])),
       StructField('hobbies', ArrayType(StringType()), True),
       StructField('properties', MapType(StringType(),StringType()), True)
    ])
# 更复杂
types.StructField(
    "properties",
    types.ArrayType(
        types.StructType([
            types.StructField("property_name", types.StringType(), False),
            types.StructField("property_value", types.StringType(), False),
        ])),
    False,
),
types.StructField(
    "info",
    types.StructType([
        types.StructField("task", types.StringType(), False),
        types.StructField("operator", types.StringType(), False),
        types.StructField("date", types.StringType(), False),
    ]),
    False,
),

Therefore, if you want to create a complex DataFrame structure, this composite nested type is sometimes a headache.

3.2 DataFrame operation

This is the highlight. The operation of DataFrame is the essence of big data processing. Due to the large number of various operations, we first sort out the commonly used ones, and then slowly learn and supplement them.

3.2.1 Basic search

Macroscopically speaking, DataFrame will also be divided into two operations: Transformation and action:

insert image description here
When we get a dataframe, like pandas, we need to see which columns it has, how much data it has, and what type of column it is, so the following functions are very commonly used:

# 显示数据结构, 每一列的数据类型等
df.printSchema()

# 显示前n条数据
df.show(n, truncate=False)
# 统计总量
df.count()
# 列名
df.columns

# 提取部分列 select,可以提取单列,多列, 按照索引提取列等
df.select('SepalLength','SepalWidth').show()
from pyspark.sql.functions import col
df.select(col("firstname"),col("lastname")).show()
df.select(df.colRegex("`^.*name*`")).show()   # 正则表达式
df.select("*").show()  # 所有列
# 类型是map的
df2.select("name.firstname","name.lastname").show(truncate=False)

# 统计信息 describe
df.describe().show()
#计算某一列的描述信息
df.describe('cls').show() 

# distinct
df.select('cls').distinct().count()     # 类似pandas的nunique
df.select('cls').distinct().collect()   # pandas的unique

With these, you can almost know the data level, scale, and columns of DataFame, and then when doing data processing, the first trick is to see if you can pass some conditions to filter out the data and columns used Wait, reduce the size of the data, because in most scenarios, not all the data is actually used.

3.2.2 Column operations

3.2.2.1 Do simple operations based on column names

Here are the commonly used functions for column operations

# 对列起别名alias
df.select(df.fname.alias("first_name"), df.lname.alias("last_name")).show()

# 修改列的类型cast
df.select(df.fname,df.id.cast("int")).printSchema()

#排序 asc, desc
df.sort(df.fname.asc()).show()
df.sort(df.fname.desc()).show()

# 根据列过滤行 filter
df.filter(df.id.between(100,300)).show()  # 这个很常用, 把大数据先按照条件过滤掉一些df.filter(() & () & ())
df.filter(df.fname.contains("Cruise")).show()
df.filter(df.fname.startswith("T")).show()
df.filter(df.fname.endswith("Cruise")).show()
df.filter(df.id.between(100,300)).show()
# 看某列是否有空 isNULL
df.filter(df.lname.isNull()).show()
df.filter(df.lname.isNotNull()).show()
# 模糊匹配
#like , rlike
df.select(df.fname,df.lname,df.id).filter(df.fname.like("%om")) 
# 子集,包含
f.select(df.fname.substr(1,2).alias("substr")).show()
li=["100","200"]
df.select(df.fname,df.lname,df.id).filter(~df.id.isin(li)).show()
testDF.select('cls').subtract(trainDF.select('cls'))

 #删除一列
 df.drop('cls').show()

3.2.2.2 withColumn manipulation

This is also a big killer for operating columns, with a very high appearance rate. It is a transformation operation used to change column values, add new columns, and modify column types.

import pyspark.sql.functions as F   # pyspark的内置函数包

# 修改列的类型 cast
df.withColumn("salary",F.col("salary").cast("Integer")).show()

# 通过已经有的列更新列的值,简单版,复杂的得需要自定义函数
df.withColumn("salary",F.col("salary")*100).show()

# 增加新列
df.withColumn("CopiedColumn",F.col("salary")* -1).show()
df.withColumn("Country", F.lit("USA")).show()   # lit函数用来用常量来作为列的默认值

# 修改列名
df.withColumnRenamed("gender","sex").show(truncate=False) 

3.2.3 Line operation

Regarding row operations, the first point here is that, unlike the DataFrame of pandas, the DataFrame of spark cannot directly retrieve each record in the row with an index such as iloc, because the internal storage of the two brothers is different, and the latter does not store the entire structure The table is stored intact. The sparkDataFrame can only be shown here, and after the collect() is converted into a list, it can be retrieved through the index.

3.2.3.1 De-resampling sort

Line here on DataFrame, commonly used will be deduplication operation and sorting display, sampling, etc.:

# 最直接的去重
distinctDF = df.distinct()
df2 = df.dropDuplicates()

# 还可以指定列值去重
dropDisDF = df.dropDuplicates(["department","salary"])

# 排序显示orderBy
df.sort(df.department.asc(),df.state.desc()).show(truncate=False)
df.sort(col("department").asc(),col("state").desc()).show(truncate=False)
df.orderBy(col("department").asc(),col("state").desc()).show(truncate=False)

# 采样sample(withReplacement, fraction, seed=None)  是否包含重复样本, 抽样比例, seed
# 可以从大的数据集中随即采样一个子集
df.sample(True, 0.2, 2022)
# 分成抽样sampleBy  -> sampleBy(col, fractions, seed=None)
df2.sampleBy("key", {
    
    0: 0.1, 1: 0.2},0).collect())

3.2.3.2 Fill and perspective

Fill missing values, here are two functions, the result is the same:

df.fillna(value, subset=None) #   df.fillna(value="")
df.nan.fill(value, subset=None) # df.na.fill(value=0,subset=["population"])

df.na.fill("unknown",["city"]).na.fill("",["type"]).show()

The pivot function converts the column value of one column of data into multiple columns, and then performs some aggregate statistics operations to generate a pivot table.

Here I need an example to see, suppose my dataFrame looks like this:
insert image description here
I want to count, for each product, the total production of each country. You can group by product, and then pivot the country column to aggregate Amount

df.groupBy('Product').pivot('Country').sum('Amount').show()
# 如果想看每个国家对于每种产品的生产总价值
df.groupBy('Country').pivot('Product').sum('Amount').show()

The result is as follows:
insert image description here
So, what if you want to return this operation? You can use the unpivot operation. Here, the stack operation in SQL is used, which is equivalent to stacking the above one by one according to the direction of the row. Here you can choose the country column you want

from pyspark.sql.functions import expr
unpivotExpr = "stack(3, 'Canada', Canada, 'China', China, 'Mexico', Mexico) as (Country,Total)"
unPivotDF = pivotDF.select("Product", expr(unpivotExpr)) \
    .where("Total is not null")
unPivotDF.show(truncate=False)

The result is as follows:
insert image description here

3.2.3.3 map and flatMap paradigm

If it is converted to rdd, the commonly used operations here are still map and flatMap operations for processing. Here are two paradigms based on an actual scenario

Scenario : Suppose I have a dataframe, and each record is the time period when the user plays with the mobile phone every day, with a start and end time. Now I feel that the granularity of this time period is too coarse, and I want to divide it into 1-minute time periods. Later, you can join with other tables to get more information. How to do it?

Here's the first normal form flatMap:

df = spark.createDataFrame([
	('zhangsan', 'kuaishou', 198379875385798, 09989842198934),
	('lisi', 'douyin', 928734817893784, 397817587835),
	('wangwu', 'bzhan', 209187901389, 0189378974831)
], ['name', 'app', 'start', 'end')

# 时间切割函数
import typing
def lower(timestamp: int, step: int):
    return timestamp // step * step
def upper(timestamp: int, step: int):
    if timestamp % step == 0:
        return timestamp
    return timestamp // step * step + step
def slice_time_range(start: int, end: int, step: int) -> typing.List[int]:
	if start == end and start % step == 0:
		return [start, start + step]
	new_start = lower(start, step)
	new_end = upper(end, step)
	return list(range(new_start, new_end+1, step))

def slice_row(row):
	_list = []
	start, end = row['start'], row['end']
	time_slice_list = slice_time_range(start, end, step=60 * 10 ** 9)
	if len(time_slice_list) > 1:
		temp = [row['name', row['app'], time_slice_list[0], time_slice_list[1]]
		_list.append(temp)
	return _list

slice_df = df.rdd.flatMap(slice_row).toDF(schema=df.schema)

# 只是范式,如果slice_row这里处理最终返回一条记录,那就是map操作

In this way, the large time segment is cut into finer-grained segments. What is the benefit of update granularity? It can make data processing more flexible. For example, one table is the time list of users playing a certain app, and the other table records the time list of playing other apps. If you want to join these two tables and find that the time period does not match, what should you do? ? What if the start and end times are different? At this time, you can adopt the idea of ​​"combining a long time must be divided, and dividing a long time must be combined". First, cut the respective time segments into finer grains, such as 1s, and then there must be overlaps. After splicing them together, merge the 1s time segments Get up and get some time. So the second paradigm here is to merge the slice_df above.

# 时间合并
def g(df):
    sort_df = df.sort_values('start')
    result_list = []
    start_idx = 0
    while start_idx < sort_df.shape[0]:
        item_dict = dict(sort_df.iloc[start_idx])
        end_idx = start_idx + 1
        while end_idx < sort_df.shape[0] and sort_df.iloc[end_idx]['start'] - sort_df.iloc[
                end_idx - 1]['end'] <= 60 * 10**9:
            end_idx += 1
        item_dict['end'] = sort_df.iloc[end_idx - 1]['end']
        result_list.append(item_dict)
        start_idx = end_idx

    result = pd.DataFrame(result_list)
    return result

df = slice_df.groupby('name').applyInPandas(g, slice_df.schema)

3.2.4 Group Aggregation

3.2.4.1 Common grouping operations

This is another very important operation. It is often used in big data analysis. The GroupBy function uses some column values ​​to group data, and aggregates each group of data to obtain the grouped and aggregated results, which are often used for statistics.

df.groupby('cls').agg({
    
    'SepalWidth':'mean','SepalLength':'max'}).show()
# avg(), count(), countDistinct(), first(), kurtosis(),
# max(), mean(), min(), skewness(), stddev(), stddev_pop(),
# stddev_samp(), sum(), sumDistinct(), var_pop(), var_samp() and variance()
df.groupBy("department").count()
df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
      avg("salary").alias("avg_salary"), \
      sum("bonus").alias("sum_bonus"), \
      max("bonus").alias("max_bonus")) \
    .where(col("sum_bonus") >= 50000) \
    .show(truncate=False)

The above are some statistical functions. Here you can also combine some functions in the built-in function library to perform some aggregation operations. For example, the very common one is to group according to a certain column and aggregate the values ​​of another column into a list.

aggregation_df = df.groupby(
            'key_column1',
            'key_column2'
        ).agg(
            F.last("xxx").alias("xxx"),
            F.last("xxx").alias("xxx"),
            F.collect_list("xxx").alias("xxx"),
        )

Then there are some more advanced operations, define some custom functions, and after groupby, use applyInPandas to operate on each group, similar to the paradigm above.

3.2.4.2 Detailed Analysis

Here is a summary of the working principle of the aggregation function of groupby, where it is necessary to re-shuffle the rdd.

The data with the same key will be grouped on one partition, and there can be multiple key groups on one partition. This process is the completion of the shuffle operation, so that after the data with the same key is on a partition, the aggregation function can be used to calculate and return the result.

3.2.5 dataFrame merge

3.2.5.1 join connection

The join connection here means that two or more DataFrames are joined according to certain fields, and the commonly used ones are inner and left

# 中间这个这样写, 两个DataFrame相同的列就会保留一个
merge_df = left_df.join(right_df, ['name', 'id', ...], 'left')
# 如果是下面这种写法, 会把join的列都保留,好处就是两个DataFrame join的列名可以不同
merge_df = left_df.join(right_df, [left_df.name = right_df.name1], 'inner')

# 还有自连接
empDF.alias("emp1").join(empDF.alias("emp2"), \
    col("emp1.superior_emp_id") == col("emp2.emp_id"),"inner") \
    .select(col("emp1.emp_id"),col("emp1.name"), \
      col("emp2.emp_id").alias("superior_emp_id"), \
      col("emp2.name").alias("superior_emp_name")) \
   .show(truncate=False)

3.2.5.2 Union operation

Used to merge two or more DataFrames with the same Schema. In fact, it is line splicing.

unionDF = df.union(df2).distinct()

3.2.6 Custom functions

This function is very powerful. You can customize some functions to process some columns of the DataFrame. One of the operations I generally like is to customize functions to add some new columns based on the existing columns of the DataFrame. This DataFrame in pandas is apply(lambda x: process(x), axis=1)such an operation.

from pyspark.sql.functions import udf, F

@udf(returnType=StringType()) 
def get_new_column(col1, col2, col3..):
	# 根据传入的列做一些复杂运算
    return res

df.withColumn("Cureated Name", upperCase(F.col("col1"), F.col("col2"), F.col("col3"))).show(truncate=False)

# 如果想处理所有列生成新的列的话,可以考虑map
def func1(x):
    firstName=x.firstname
    lastName=x.lastname
    name=firstName+","+lastName
    gender=x.gender.lower()
    salary=x.salary*2
    return (name,gender,salary)

df.rdd.map(lambda x: func1(x)).toDF(schema)

3.2.7 Built-in function library

The built-in function library of pyspark is also very powerful. Here are some built-in functions currently used. These functions are all in pyspark.sql.functions.

3.2.7.1 Data related

  1. when:
    similar to the expression of case when and then in sql, or the expression of switch...if then else

    from pyspark.sql.functions import when
    df2 = df.withColumn("new_gender", when(df.gender == "M","Male")
                                     .when(df.gender == "F","Female")
                                     .when(df.gender.isNull() ,"")
                                     .otherwise(df.gender))
    

    This can be directly replaced by case when in sql

    from pyspark.sql.functions import expr
    
    #Using Case When on withColumn()
    df3 = df.withColumn("new_gender", expr("CASE WHEN gender = 'M' THEN 'Male' " + 
                   "WHEN gender = 'F' THEN 'Female' WHEN gender IS NULL THEN ''" +
                   "ELSE gender END"))
    

    In the actual scenario, I have a need to use when, which is to count the number of non-zero values ​​​​in a given column by group. This stuck with me for a long time, and I finally wrote a code and recorded it.

    def count_non_zero(df, features, grouping):
        return df.groupBy(grouping).agg(*[F.count(F.when(F.col(c) != 0, 1)).alias(f"{
            
            c}_no_zero_count") for c in features])
    

  2. The parameter of the expr function is a sql syntax string, which can directly parse sql, such as the above operation, and another one is to splice two lines of strings to get a new column

    from pyspark.sql.functions import expr
    df.withColumn("Name",expr(" col1 ||','|| col2")).show()
    df.select("increment",expr("cast(increment as string) as str_increment"))
    df.filter(expr("col1 == col2")).show()
    
  3. When lit
    adds a new column, assign a constant default value to the new column

    from pyspark.sql.functions import when, lit, col
    df2 = df.select(col("EmpId"),col("Salary"),lit("1").alias("lit_value1"))
    df3 = df2.withColumn("lit_value2", when(col("Salary") >=40000 & col("Salary") <= 50000,lit("100")).otherwise(lit("200")))
    
  4. split
    This is used to split the string column to generate a new column, the column type is a list

    df2 = df.select(split(col("name"),",").alias("NameArray")).drop("name")
    

    Inverse operation concat_ws, combine list type columns with strings to get a string, similar to','.join([])

    df2 = df.withColumn("languagesAtSchool", concat_ws(",",col("languagesAtSchool")))
    
  5. explode
    This function, used for list type data, will be expanded into rows according to each element

    +-------+-----------------------------------+
    |name   |subjects                           |
    +-------+-----------------------------------+
    |James  |[[Java, Scala, C++], [Spark, Java]]|
    |Michael|[[Spark, Java, C++], [Spark, Java]]|
    |Robert |[[CSharp, VB], [Spark, Python]]    |
    +-------+-----------------------------------+
    # 如果想把这些学科,展开成行,可以先用flatten函数, 把这种二维列表展平成一维,然后再用explode函数
    df.select(df.name,flatten(df.subjects)).show(truncate=False)
    df.select(df.name,explode(df.subjects)).show(truncate=False)
    +-------+--------------+
    |name   |subjects      |
    +-------+--------------+
    |James  |Java          |
    |James  |Scala         |
    |James  |C++           |
    +-------+--------------+
    

3.2.7.2 Time-dependent

3.2.7.3 Aggregated Statistical Correlation

3.2.7.4 json-related

3.2.8 DataFrame writing

Write here, first record a partitionBy function, this is when writing data, you can specify partitions, partitions can be efficiently retrieved in the storage of the data lake, the so-called partitions, in fact, store the data in directories according to the specified columns, so When querying, you only need to query a small range of data in the corresponding directory, instead of querying the entire amount. In the case of large data, the retrieval efficiency can be greatly improved. For example, when writing to csv, you can specify the partition

#partitionBy()
df.write.option("header",True) \
        .partitionBy("state") \
        .mode("overwrite") \
        .csv("/tmp/zipcodes-state")

When storing in this way, instead of saving all the data in one file, it saves multiple files according to the number of starte values.
insert image description here
When searching in this way, if starte=AZ is specified, then only the data of the AZ file will be taken out, and there is no need to search for all. Therefore, when partitioning, the primary key field of the record is used as the partition field.

Write here, you can csv, parquet, json, etc., but I generally don’t need to write data now, so I won’t organize this piece first.

4. Xiao Zong

This article is mainly to sort out some operations of data processing that I currently use under pyspark. There are many operations, and I don’t need to remember them. I want to refer to them in the future, because the previous knowledge is scattered and scattered. system, every time you use it, you have to check Baidu now, which wastes a lot of time, so you simply take a weekend to organize these things together, and then meet new ones later, and then add them.

Guess you like

Origin blog.csdn.net/wuzhongqiang/article/details/127932881