pyspark读取hive表解析json日志并写入hive表的简单案例——原始数据初步清洗

有如下数据:

32365 MOVE 1577808000000 {"goodid": 478777, "title": "商品478777", "price": "12000", 
	"shopid": "1", "mark": "mark"} 6.0.0 android {"browsetype": "chrome", "browseversion": "82,0"}
90339 MOVE 1577808008000 {"goodid": 998446, "title": "商品998446", "price": "12000",
	 "shopid": "1", "mark": "mark"} 6.0.0 android {"browsetype": "chrome", "browseversion": "82,0"}
10519 ORDER 1577808016000 {"goodid": 914583, "title": "商品914583", "price": "12000",
	"shopid": "1", "mark": "mark"} 6.0.0 android {"browsetype": "chrome", "browseversion": "82,0"}
53844 CART 1577808024000 {"goodid": 4592971, "title": "商品4592971", "price": "12000",
	 "shopid": "1", "mark": "mark"} 6.0.0 android {"appid": "123456", "appversion": "11.0.0"}

字段如下:

  • 其中goodinfo和appinfo为上表所示的json格式的字段,现要求使用pyspark提取出解析其中的json内容(也可以用正则提取)并写入到hive表中
userid int,
action string,
acttime string,
goodinfo string,
version string,
system string,
appinfo string

思路分析:

  • 本题只需要解析json内容,并且是从表中读取写到一张新表中,所以本题使用pyspark,可以使用pyspark下的get_json_object函数

踩雷汇总:

  • python字符集与hadoop平台默认不一致,因此需要在第一行加# -*- coding:utf-8 -*-使得文本在utf-8的环境下生成
  • 配置本地spark,安装findpyspark模块并且初始化本地模式下的spark,否则会报py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.isEncryptionEnabled does not exist in the JVM的错误

代码如下:

# -*- coding:utf-8 -*-
import findspark
findspark.init()
from pyspark.sql import HiveContext, SparkSession
import pyspark.sql.functions as F

if __name__ == '__main__':
    spark = SparkSession.builder.master("local[*]").appName("logs") \
        .config("hive.metastore.uris", "thrift://single:9083") \
        .enableHiveSupport().getOrCreate()
    df = spark.sql("select * from ods_myshops.ods_logs")
    df.withColumn("goodid",F.get_json_object("goodinfo","$.goodid"))\
        .withColumn("title",F.get_json_object("$.goodinfo","$.title"))\
        .withColumn("price",F.get_json_object("goodinfo","$.price"))\
        .withColumn("shopid",F.get_json_object("goodinfo","$.shopid"))\
        .withColumn("mark",F.get_json_object("goodinfo","$.mark"))\
        .withColumn("soft"
                    ,F.when(F.instr(df['appinfo'],"browsetype")==0
                    ,F.get_json_object("appinfo","$.appid"))
                    .otherwise(F.get_json_object("appinfo","$.browsetype")))\
        .withColumn("soft_version"
                    ,F.when(F.instr(df['info'],"browsetype")==0
                    ,F.get_json_object("appinfo","$.appversion"))
                    .otherwise("appinfo","$.browseversion"))\
        .drop("goodinfo","appinfo")\
        .write.format("hive").mode("overwrite")\
        .saveAsTable("ods_myshops.ods_newlog")

猜你喜欢

转载自blog.csdn.net/xiaoxaoyu/article/details/114714373