实战案例hive-on-spark:医疗大数据-数据仓库ETL

2.6.1 方案的选择

总结:
1)Spark定位于内存计算框架:分布式计算RDD、实时计算spark stream、结构化查询saprkSQL、数据挖掘spark.ML
2)类比hadoop生态:分布式存储hdfs、数据仓库hive(meta、数据存储基于hdfs)、yarn分布式资源调度、nosql数据库hbase
3)综合优化方案:sparkSQL做多数据源IO接入,RDD做数据清理、转换、抽取,sparkSQL做数据结构化加载,hive提供meta和数据仓库操作,数据根本存储在hdfs

2.6.2 hive meta模式

1)内嵌模式(元数据保村在内嵌的derby种,允许一个会话链接,尝试多个会话链接时会报错)
2)本地模式(本地安装mysql 替代derby存储元数据)
3)远程模式(远程安装mysql 替代derby存储元数据)

2.6.3 remote集群安装

1)下载:apache-hive-2.3.4

2)Metastore server端 hive-site配置:

<configuration>

<!--hive数据库的数据文件-->
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>hdfs://hadoop01:9000/hadoop/hive/warehouse</value>
</property>

<!--metastore service: hive连接mysql,mysql存储hive meta信息-->
<property>
    <name>javax.jdo.option.ConnectionURL</name>    	<value>jdbc:mysql://192.168.0.252:3306/hivemeta?createDatabaseIfNotExist=true&amp;ch	aracterEncoding=UTF-8&amp;useSSL=false</value>
</property>

<property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value></value>
</property>

</configuration>

3)Hive客户端配置:

<configuration>

<!--hive数据库的数据文件-->
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>hdfs://hadoop01:9000/hadoop/hive/warehouse</value>
</property>

<property> 
	<name>hive.metastore.local</name> 
	<value>false</value> 
</property> 

<property> 
	<name>hive.metastore.uris</name>
	<value>thrift://192.168.0.101:9083</value> 
</property>

</configuration>[Hive on spark 只需要meta store server不需要hive客户端,spark已经集成hiveCONTEXT到sqlCONTEXT中?]

4)Hive环境变量:hive-env.sh(服务端与客户端都需要)

JAVA_HOME=/data/software/jdk1.8.0_191

HADOOP_HOME=/data/app/hadoop/hadoop-2.7.7
HADOOP_CONF_DIR=/data/app/hadoop/hadoop-2.7.7/etc/hadoop

HIVE_HOME=/data/app/hadoop/apache-hive-2.3.4
HIVE_CONF_DIR=/data/app/hadoop/apache-hive-2.3.4/conf

备注最终/etc/profile:
export JAVA_HOME=/data/software/jdk1.8.0_191
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib

export HADOOP_HOME=/data/app/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_HOME=/data/app/hadoop/spark-2.4.0-bin-hadoop2.7
export YARN_HOME=${HADOOP_HOME}
export YARN_CONF_DIR=${YARN_HOME}/etc/hadoop
export HIVE_HOME=/data/app/hadoop/apache-hive-2.3.4
export HIVE_CONF_DIR=/data/app/hadoop/apache-hive-2.3.4/conf

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$HIVE_HOME/bin[/etc/profile全局配置,其他env.sh使用可export或者]

5)Hive on spark

客户端hive-site.xml copy到spark节点conf/

6)SparkSQL启动数据仓库

./spark-2.4.0-bin-hadoop2.7/bin/spark-sql --master spark://hadoop01:7077 --conf  spark.sql.warehouse.dir=hdfs://hadoop01:9000:/hadoop/hive/warehouse

2.6.4 数据仓库ETL
1)extract

sql_str = "(select a.*, ROWNUM rownum__rn from his_test.op_patient_cost a) b"
ct = 58475050

# extract
  '''
  extract操作:oracle数据类型与hive数据类型对照:
    oracle ----- hive
    VARCHAR2     string
    VARCHAR      string
    NUMBER       decimal/int
    DATE         timestamp
    DATETIME     timestamp
    CLOB         string
    BLOB         binary
    OTHER        string    
    '''
    df = spark.read \
        .format("jdbc") \
        .option("url", "jdbc:oracle:thin:@192.168.0.251:1521:orcl") \
        .option("dbtable", sql_str) \
        .option("user", "****") \
        .option("password", "***") \
        .option("fetchSize", 100) \
        .option("partitionColumn", "rownum__rn") \
        .option("numPartitions", 2) \
        .option("lowerBound", 0) \
        .option("upperBound", ct) \
        .load() \
        .drop("rownum__rn")
    # df.printSchema()

2 ) transform

  • 选取列、列名处理

    # oracle列名转换为小写:datframe是schema RDD分布式存储数据,避免使用原生python迭代-schema信息是存储单节点
    #  df[name].alias(name.lower()),alias返回column配合select使用
    names = df.columns
      for name in names:
          df = df.withColumnRenamed(name, name.lower()
    
  • 非空处理

     df = df.na.fill("")
     df = df.na.fill({"is_insurup": 0, "tsort": 0, "comb_cost_count": 0, "order_type": 0,
                      "cost_money": 0.0, "prefer_money" : 0.0, "tstatus": 0})
    
  • 删除不参与分析列

     df = df.drop("group_no", "comb_random", "remark", "is_corr", "is_insurup", "receipt_print_flag", "send_mtl_flag")
    
  • 数据类型转换

     df = df.withColumn('order_type', df.order_type.cast(IntegerType()))
     df = df.withColumn('cost_count', df.cost_count.cast(IntegerType()))
     df = df.withColumn('is_comb', df.is_comb.cast(IntegerType()))
     # dfc = dfc.withColumn('is_insurup', dft.is_insurup.cast(IntegerType()))
     df = df.withColumn('tsort', df.tsort.cast(IntegerType()))
     df = df.withColumn('tstatus', df.tstatus.cast(IntegerType()))
     df = df.withColumn('cost_tstatus', df.cost_tstatus.cast(IntegerType()))
     df = df.withColumn('payment_tstatus', df.payment_tstatus.cast(IntegerType()))
     df = df.withColumn('is_append', df.is_append.cast(IntegerType()))
     # dfc = dfc.withColumn('send_mtl_flag', dft.send_mtl_flag.cast(IntegerType()))
     df = df.withColumn('comb_cost_count', df.comb_cost_count.cast(IntegerType()))
     # dfc = dfc.withColumn('is_corr', dft.is_corr.cast(IntegerType()))
     # df.printSchema()
    
  • 刷选

     # 可以使用&或|对两个bool列进行逻辑运算,但必须要用圆括号括起,限定运算顺序
     df = df.filter((df.tstatus == 2) & (df.cost_money > 0.0))
    
  • 聚合

     # 聚合:按费用业务类型分组聚合
     df_order = df.groupBy("order_type").agg(F.sum(df.cost_money - df.prefer_money), F.max(df.cost_money - df.prefer_money))
     df_order = df_order.withColumnRenamed("sum((cost_money - prefer_money))", "sum_group_order")
     df_order = df_order.withColumnRenamed("max((cost_money - prefer_money))", "max_group_order")
     df_order.show(200, truncate=False)
    
     # 聚合:按诊断科室分组聚合
     df_depart = df.groupBy("op_depart_code").agg(F.sum(df.cost_money - df.prefer_money), F.max(df.cost_money - df.prefer_money))
     df_depart = df_depart.withColumnRenamed("sum((cost_money - prefer_money))", "sum_group_depart")
     df_depart = df_depart.withColumnRenamed("max((cost_money - prefer_money))", "max_group_depart")
     df_depart.show(200, truncate=False)
    
  • 窗口聚合

     class_window = Window.partitionBy("order_type").orderBy(F.desc("cost_money"))  # 降序排列
     class_rank = F.rank().over(class_window)
     # class_row_number = row_number().over(class_window)  # 窗口函数(xxx).over(window),就是一般的用法
     df_order_rank = df.withColumn("gby_ordertype_oby_desc", class_rank)
     df_order_rank.show(200, truncate=False)
    

3)load

 '''
    load操作:
    '''
    # spark.sql("CREATE DATABASE outpatient_stat")
    spark.sql("use outpatient_stat")

    # df_order.createOrReplaceTempView("outpatient_cost_by_order")
    # df_depart.createOrReplaceTempView("outpatient_cost_by_depart")
    # spark.sql("select * from outpatient_cost_by_order").show(200, truncate=False)

    # 保存数据到数据仓库
    df_order.write.mode("overwrite").format("parquet").saveAsTable(
        "outpatient_stat.outpatient_cost_by_order")

    df_depart.write.mode("overwrite").format("parquet").saveAsTable(
        "outpatient_stat.outpatient_cost_by_depart")

    df_order_rank.write.mode("overwrite").format("parquet").saveAsTable(
        "outpatient_stat.outpatient_cost_by_rank")

    spark.sql("SELECT * FROM outpatient_stat.outpatient_cost_by_order").show(200, truncate=False)

    spark.sql("SELECT * FROM outpatient_stat.outpatient_cost_by_depart").show(200, truncate=False)

    spark.sql("SELECT * FROM outpatient_stat.outpatient_cost_by_rank").show(200, truncate=False)
发布了30 篇原创文章 · 获赞 9 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/wolfjson/article/details/88298820