使用PySpark将KUDU的数据写入HBase
之前写了几十个标签在KUDU,今天主要是讲KUDU的标签内容写入HBase,内容比较简单,供大家参考:
首先我使用的版本如下:
Spark:2.2.0
KUDU:1.7.0
HBase:1.1.2
依赖的jar:
shc-core-1.1.2-2.2-s_2.11-SNAPSHOT-shaded.jar
kudu-spark-1.0-SNAPSHOT.jar
我主要是将多个kudu的用户识别字段和标签字段以及时间字段写入HBase,所以需要使用循环标签表然后将用户识别字段作为rowkey,标签和时间写入cf列簇里面,所以catalog的配置如下:
{ "table":{"namespace": "default", "name": "test_result"}, "rowkey": "key", "columns": { "rk": {"cf": "rowkey", "col": "key", "type": "binary"}, "%s": {"cf": "cf", "col": "%s", "type": "string"}, "ts": {"cf": "cf", "col": "ts", "type": "string"} } } |
在shc-core-1.1.2-2.2-s_2.11-SNAPSHOT-shaded.jar
当中我们可以找到关于catalog的配置:
以及:
详细代码如下:
# -*- coding: cp936 -*-
"""
@time: 2018/09/20
"""
import os
import sys
os.environ['SPARK_HOME'] = '/usr/hdp/2.6.3.0-235/spark2'
sys.path.append("/usr/hdp/2.6.3.0-235/spark2/python")
from pyspark.sql import SparkSession
if __name__ == '__main__':
input = open("./HBASE_CONFIG")
lists = input.readlines()
for i in range(lists.__len__()):
spark = SparkSession.builder.appName("KUDU_HBASE").enableHiveSupport().getOrCreate()
list = lists[i].split(':')
kudutable = list[0]
column = list[1]
column_ts = list[2]
spark.read.format('org.apache.kudu.spark.kudu') \
.option('kudu.master', "kudu_hostname-1:port,kudu_hostname-2:port,kudu_hostname-3:port") \
.option('kudu.table', "%s" % (kudutable)) \
.option('kudu.faultTolerantScan', 'true') \
.load().registerTempTable("temp")
spark.sql(
"create temporary function label_rowkey_%s as 'com.bigdata.LableRowKeyUDF'" % (
column))
df = spark.sql(
"select label_rowkey_%s('mid',trim(id),'%s') as rk,cast(result as string) as %s,substring(modifytime,0,19) as ts from temp where id is not null and id!='' and id!='null'" % (
column, column,column))
catalog = "".join("""{
"table":{"namespace": "default", "name": "test_result"},
"rowkey": "key",
"columns": {
"rk": {"cf": "rowkey", "col": "key", "type": "binary"},
"%s": {"cf": "cf", "col": "%s", "type": "string"},
"ts": {"cf": "cf", "col": "ts", "type": "string"}
}
}""".split())% (column,column)
df.write.options(catalog=catalog) \
.mode('append') \
.format("org.apache.spark.sql.execution.datasources.hbase") \
.option("hbaseConfiguration",
'{"zookeeper.znode.parent":"/hbase-unsecure","hbase.zookeeper.quorum":"hbase_hostname-1,hbase_hostname-2,hbase_hostname-3"}') \
.option("newTable", "6") \
.save()
注意:我注册临时函数的时候是label_rowkey_%s,使用了占位符,每次循环里面注册的临时函数名称不一样,是动态的。
代码读取的配置文件格式(第一列是kudu的表名,第二列是要存储到hbase的标签字段,第三列是要存储hbase的标签时间字段,最好一列是为了防止写入hbase读取过程配置catalog出错,会读取一些类似\等无法去除):
|