使用PySpark将KUDU的数据写入HBase

使用PySpark将KUDU的数据写入HBase

之前写了几十个标签在KUDU,今天主要是讲KUDU的标签内容写入HBase,内容比较简单,供大家参考:

首先我使用的版本如下:

Spark:2.2.0

KUDU:1.7.0

HBase:1.1.2

依赖的jar:

shc-core-1.1.2-2.2-s_2.11-SNAPSHOT-shaded.jar
kudu-spark-1.0-SNAPSHOT.jar

我主要是将多个kudu的用户识别字段和标签字段以及时间字段写入HBase,所以需要使用循环标签表然后将用户识别字段作为rowkey,标签和时间写入cf列簇里面,所以catalog的配置如下:

{
                            "table":{"namespace": "default", "name": "test_result"},
                            "rowkey": "key",
                            "columns": {
                                "rk": {"cf": "rowkey", "col": "key", "type": "binary"},
                                "%s": {"cf": "cf", "col": "%s", "type": "string"},
                                "ts": {"cf": "cf", "col": "ts", "type": "string"}
                }
                }

shc-core-1.1.2-2.2-s_2.11-SNAPSHOT-shaded.jar当中我们可以找到关于catalog的配置:

以及:

详细代码如下:

# -*- coding: cp936 -*-
"""  
@time: 2018/09/20 
"""
import os
import sys

os.environ['SPARK_HOME'] = '/usr/hdp/2.6.3.0-235/spark2'
sys.path.append("/usr/hdp/2.6.3.0-235/spark2/python")
from pyspark.sql import SparkSession


if __name__ == '__main__':
    input = open("./HBASE_CONFIG")
    lists = input.readlines()
    for i in range(lists.__len__()):
        spark = SparkSession.builder.appName("KUDU_HBASE").enableHiveSupport().getOrCreate()
        list = lists[i].split(':')
        kudutable = list[0]
        column = list[1]
        column_ts = list[2]
        spark.read.format('org.apache.kudu.spark.kudu') \
            .option('kudu.master', "kudu_hostname-1:port,kudu_hostname-2:port,kudu_hostname-3:port") \
            .option('kudu.table', "%s" % (kudutable)) \
            .option('kudu.faultTolerantScan', 'true') \
            .load().registerTempTable("temp")
        spark.sql(
            "create temporary function label_rowkey_%s as 'com.bigdata.LableRowKeyUDF'" % (
                column))
        df = spark.sql(
            "select label_rowkey_%s('mid',trim(id),'%s') as rk,cast(result as string) as %s,substring(modifytime,0,19) as ts from temp where id is not null and id!='' and id!='null'" % (
            column, column,column))
        catalog = "".join("""{
                            "table":{"namespace": "default", "name": "test_result"},
                            "rowkey": "key",
                            "columns": {
                                "rk": {"cf": "rowkey", "col": "key", "type": "binary"},
                                "%s": {"cf": "cf", "col": "%s", "type": "string"},
                                "ts": {"cf": "cf", "col": "ts", "type": "string"}
                }
                }""".split())% (column,column)
        df.write.options(catalog=catalog) \
            .mode('append') \
            .format("org.apache.spark.sql.execution.datasources.hbase") \
            .option("hbaseConfiguration",
                    '{"zookeeper.znode.parent":"/hbase-unsecure","hbase.zookeeper.quorum":"hbase_hostname-1,hbase_hostname-2,hbase_hostname-3"}') \
            .option("newTable", "6") \
            .save()

注意:我注册临时函数的时候是label_rowkey_%s,使用了占位符,每次循环里面注册的临时函数名称不一样,是动态的。

代码读取的配置文件格式(第一列是kudu的表名,第二列是要存储到hbase的标签字段,第三列是要存储hbase的标签时间字段,最好一列是为了防止写入hbase读取过程配置catalog出错,会读取一些类似\等无法去除):

TEST_KUDU.TEST_AGERANGES:AGERANGES:AGERANGES_TS:1
TEST_KUDU.TEST_BED1ROOM_PROB:BED1ROOM_PROB:BED1ROOM_PROB_TS:2
TEST_KUDU.TEST_JIANYE_ONEYEARS:JIANYE_ONEYEARS:JIANYE_ONEYEARS_TS:3
TEST_KUDU.TEST_BED2ROOM_PROB:BED2ROOM_PROB:BED2ROOM_PROB_TS:4
TEST_KUDU.TEST_CITY_ONESELF:CITY_ONESELF:CITY_ONESELF_TS:5
TEST_KUDU.TEST_CREVENUEALL:CREVENUEALL:CREVENUEALL_TS:6

猜你喜欢

转载自blog.csdn.net/qq_37050993/article/details/83342425
今日推荐