Pyspark

Note: If you think the blog is good, don’t forget to like and collect it. I will update the content related to artificial intelligence and big data every week. Most of the content is original, Python Java Scala SQL code, CV NLP recommendation system, etc., Spark Flink Kafka, Hbase, Hive, Flume, etc. are all pure dry goods, and the interpretation of various top conference papers makes progress together.
Continue to share with you today Pyspark_structured flow 4
#博学谷IT学报报支持

Article directory

Pyspark
foreword
1. Data simulator code
2. Requirements description and code implementation
Summarize

foreword

Continuing from last time, Pyspark_structured flow, today is mainly a small case of combining structured flow with kafka.

1. Data simulator code

1- Create a topic to place subsequent IoT data: search-log-topic
./kafka-topics.sh --create --zookeeper node1:2181 --topic search-log-topic --partitions 3 --replication-factor 2

import json
import random
import time
import os
from kafka import KafkaProducer


# 锁定远端操作环境, 避免存在多个版本环境的问题
os.environ['SPARK_HOME'] = '/export/server/spark'
os.environ["PYSPARK_PYTHON"] = "/root/anaconda3/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/root/anaconda3/bin/python"

# 快捷键:  main 回车
if __name__ == '__main__':
    print("模拟物联网数据")

    # 1- 构建一个kafka的生产者:
    producer = KafkaProducer(
        bootstrap_servers=['node1:9092', 'node2:9092', 'node3:9092'],
        acks='all',
        value_serializer=lambda m: json.dumps(m).encode("utf-8")
    )
    # 2- 物联网设备类型
    deviceTypes = ["洗衣机", "油烟机", "空调", "窗帘", "灯", "窗户", "煤气报警器", "水表", "燃气表"]

    while True:
        index = random.choice(range(0, len(deviceTypes)))
        deviceID = f'device_{
    
    index}_{
    
    random.randrange(1, 20)}'
        deviceType = deviceTypes[index]
        deviceSignal = random.choice(range(10, 100))

        # 组装数据集
        print({
    
    'deviceID': deviceID, 'deviceType': deviceType, 'deviceSignal': deviceSignal,
               'time': time.strftime('%s')})

        # 发送数据
        producer.send(topic='search-log-topic',
                      value={
    
    'deviceID': deviceID, 'deviceType': deviceType, 'deviceSignal': deviceSignal,
                                       'time': time.strftime('%s')}
        )

        # 间隔时间 5s内随机
        time.sleep(random.choice(range(1, 5)))

Generated kafka data
{'deviceID': 'device_0_14', 'deviceType': 'washing machine', 'deviceSignal': 18, 'time': '1680157073'} {'deviceID': 'device_2_8', 'deviceType': 'air
conditioner ', 'deviceSignal': 30, 'time': '1680157074'}
{'deviceID': 'device_0_17', 'deviceType': 'washing machine', 'deviceSignal': 84, 'time': '1680157076'} {'
deviceID ': 'device_2_15', 'deviceType': 'air conditioner', 'deviceSignal': 99, 'time': '1680157078'} {'deviceID'
: 'device_1_17', 'deviceType': 'range hood', 'deviceSignal': 50, 'time': '1680157081'}

2. Requirements description and code implementation

Find: the number and average signal strength of each type of equipment with signal strength > 30, first filter and then aggregate

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import os

# 锁定远端环境, 确保环境统一
os.environ['SPARK_HOME'] = '/export/server/spark'
os.environ['PYSPARK_PYTHON'] = '/root/anaconda3/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/root/anaconda3/bin/python3'

if __name__ == '__main__':
    print("综合案例: 物联网案例实现")

    # 1- 创建SparkSession对象
    spark = SparkSession.builder \
        .appName('file_source') \
        .master('local[1]') \
        .config('spark.sql.shuffle.partitions', 4) \
        .getOrCreate()

    # 2- 从Kafka中读取消息数据
    df = spark.readStream \
        .format('kafka') \
        .option('kafka.bootstrap.servers', 'node1:9092,node2:9092,node3:9092') \
        .option('subscribe', 'search-log-topic') \
        .option('startingOffsets', 'earliest') \
        .load()

    # 3- 处理数据
    # 求: 各种信号强度>30的设备的各个类型的数量和平均信号强度,先过滤再聚合
    # 数据: {
    
    'deviceID': 'device_4_4', 'deviceType': '灯', 'deviceSignal': 20, 'time': '1677243108'}
    df = df.selectExpr('CAST(value AS STRING)')

    # 思考 如何做呢?
    # 需要将这个Json字符串中各个字段都获取出来, 形成一个多列的数据
    # 专业名称: JSON拉平
    # 涉及函数: get_json_object()    json_tuple()
    # df.createTempView('t1')

    # SQL
    # df = spark.sql("""
    #     select
    #         get_json_object(value,'$.deviceID')  as deviceID,
    #         get_json_object(value,'$.deviceType') as deviceType,
    #         get_json_object(value,'$.deviceSignal') as deviceSignal,
    #         get_json_object(value,'$.time') as time
    #     from  t1
    # """)
    # df = spark.sql("""
    #     select
    #         json_tuple(value,'deviceID','deviceType','deviceSignal','time') as (deviceID,deviceType,deviceSignal,time)
    #     from  t1
    # """)

    # DSL
    # df = df.select(
    #     F.get_json_object('value', '$.deviceID').alias('deviceID'),
    #     F.get_json_object('value','$.deviceType').alias('deviceType'),
    #     F.get_json_object('value','$.deviceSignal').alias('deviceSignal'),
    #     F.get_json_object('value','$.time').alias('time')
    # )

    df = df.select(
        F.json_tuple('value', 'deviceID', 'deviceType', 'deviceSignal', 'time').alias('deviceID', 'deviceType',
                                                                                      'deviceSignal', 'time')
    )

    # 求: 各种信号强度>30的设备的各个类型的数量和平均信号强度,先过滤再聚合
    df = df.where(df['deviceSignal'] > 30).groupBy('deviceType').agg(
        F.count('deviceID').alias('device_cnt'),
        F.round(F.avg('deviceSignal'), 2).alias('deviceSignal_avg')
    )
    # 4- 打印结果
    df.writeStream.format('console').outputMode('complete').start().awaitTermination()

Summarize

Today I mainly share with you how to use Pyspark_structured flow combined with kafka to simulate a small case of the Internet of Things.

Pyspark_structured streaming 4