Pyspark data engineer is enough to read this article

Big data precursor knowledge

  • hadoop: a big data computing framework that uses hdfs as storage, and multiple cheap clusters form a cluster
  • hive: Rich SQL query methods to analyze the data stored in the Hadoop distributed file system: structured data files can be mapped to a database table and provide complete SQL query functions;
  • mapreduce: A computing task is split into multiple parts, distributed to computers under the cluster, and multiple computers perform parallel calculations and summarize the results.

1. Background introduction

Spark is the same distributed data computing framework as Hadoop, but Hadoop is based on HDFS file storage, and Spark is memory-based, so it is faster than Hadoop in terms of speed. It is basically to convert data into tables

1.1 Main components

  • spark core
    • RDD: A flexible distributed data set, a collection of read-only partition records, which can only be created based on deterministic operations performed on data sets in stable physical storage and other existing RDDs.
    • DAG: Directed acyclic graph, that is, starting from a point and going back to this point after a few points (no loop), in spark, the relationship of RDD is modeled to describe the dependency of RDD.
  • spark sql: write calculation logic in sql

Here are the nouns you need to know for python operations

  • SparkContext: The entry point of any spark function, mainly used to create and manipulate RDD
  • SparkSession: Before 1.2, operating different functions requires different contexts, and now directly using SparkSession is a combination of them and their functions can be used
    • When creating and operating RDD, use SparkContext
    • When using Streaming, use StreamingContext
    • When using SQL, use sqlContext
    • When using Hive, use HiveContext
  • DataFrame: A basic table, storing data in the form of a two-dimensional table, and schema (data structure information), which has higher performance than RDD and has storage structure information
  • DateSet: more detailed than DateFrame to know the type of each field

1.2 Environmental installation

  • windows operating system
  • python 3.7
  • pySpark library

Two, develop grammar and the like

List of basic data like this

{"database":"gateway_db","xid":1727259328,"data":{"point_id":"YC13002","update_time":"2020-12-27 22:00:30","ex":null,"create_time":"2020-12-27 22:00:30","point_value":3.78,"equipment":"C1","push_time":"2020-12-27 22:00:22","client_id":"d35e0c87-ed79-45ac-bf00-1e3abc35e9e3","quality":0},"old":{"update_time":"2020-12-27 21:59:30","create_time":"2020-12-27 21:59:30","point_value":3.93,"push_time":"2020-12-27 21:59:22"},"commit":true,"position":"mysql-bin.000614:445978746","type":"update","server_id":1228688365,"table":"real_msg","ts":1609077630}
2.1 Read data

Can be read from oss, read from kafka, read from local files

  spark = SparkSession.builder\
          .appName("test") \
          .config("spark.some.config.option", "一些设置") \
          .getOrCreate() 
 # 这个是读取json 文件 将每一个读取成 dataframe
 df = spark.read.json("F://prod-zhongda+0+0024526281.json")
 
 # 读取csv 也是一样
 df = spark.read.csv("F://prod-zhongda+0+0024526281.csv")

2.2 Data filtering

 # filter 过滤
 # 第一种 直接在 "" 里面写表达式 
 df.filter(" xid > 33333333 ")
 # 第二种 直接写表达式 
 df.filter(df.xid > 33333)

 # where 过滤
 df.where(df.xid > 3333)    


 # udf 过滤 
 from pyspark.sql.functions import udf
 from pyspark.sql.types import *
 # 可以直接定义一个自定义函数 并注册进 sql 可以用的
 test_method = udf(lambda x:(x+1),LongType())
 spark.udf.register("test_method", test_method)    
 #也可以在注册的时候 直接定义一个函数
 spark.udf.register("test_method", test_method)

2.3 api data calculation/processing

This calculation and processing must first determine whether it is df or rdd

# 输出 df 的格式
 print(df.printSchema())

# 将 df转成 一个list  
 dfList = df.collect()

2.4 Calculation and processing with sql

First, create a table in memory and then operate on this table. The table is deleted as the session is closed.

 # 创建一个临时表
 df.createOrReplaceTempView("data_handler");   
 # 用sql 查询临时表中的数据
 df = spark.sql("select * from data_handler")

 # 用select 对数据进行 处理过滤
 # 可以查看每一列的值 
 df2 = df.select("xid","data")
 
 # 也可以对进行分组聚合  分组和聚合一般都是放在一起算的
 # 分组
 group = df.groupby("xid")
 # 聚合
 # 可以直接用函数 {"字段名":"函数"}   分完组了也可以根据其他的字段
 df3 = group.agg({
    
    "ts":"max"})
 # 也可以引入函数进行聚合 可以返回多个字段
 from pyspark.sql import functions as F
 df3 = group.agg(F.max("ts"),F.min("ts"))

 # 将多个表进行连接  根据表达式里面的字段进行连接
 df4 = df3.join(df,df["ts"] == df3["ts"])


The difference and commonality of RDD, DateFrame, DateSet

  • Same point
    • First of all, they are loaded lazily, and will only be calculated when they encounter action and foreach
    • The three have the concept of partition, and the three have many common functions, such as filter, sorting, etc.
    • The three will automatically cache operations according to the spark memory situation, so that even if the amount of data is large, there is no need to worry about memory overflow
  • Differences: I feel that the control strength is different, one by one, more detailed

Guess you like

Origin blog.csdn.net/hgdzw/article/details/112861114
Recommended