centos:7.2
python2:2.7.5
python3:3.6.5
spart:2.2.0
ambari:2.6.1
hdp:2.6.4
一、指定pyspark的python 通过python3运行
1.软连接python3到/usr/bin/目录下
ln -s /usr/local/python3/bin/python3 /usr/bin/
2.修改/usr/bin/pyspark文件
在文件中增加:
export PYSPARK_PYTHON=python3
二、读取txt文件到spark后保存到HIVE中
# -*- coding: utf-8 -*-
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.sql import Row
if __name__ == "__main__":
# 初始化SparkSession
spark = SparkSession \
.builder \
.appName("TextToHive") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
sc = spark.sparkContext
lines = sc.textFile("hdfs://10.250.11.52:8020/source/db/centercode/20180507")
parts = lines.map(lambda l: l.split(","))
centercode = parts.map(lambda p: Row(centercode=p[0], centername=p[1],qscode=p[2]))
#RDD转换成DataFrame
centercode_temp = spark.createDataFrame(centercode)
#显示DataFrame数据
centercode_temp.show()
#创建视图
centercode_temp.createOrReplaceTempView("t_centercode")
#过滤数据
employee_result = spark.sql("SELECT centercode,centername, qscode FROM t_centercode")
#CREATE HIVE TABLE
spark.sql("CREATE TABLE IF NOT EXISTS oracledb.t_lnt_basic_center_code (centercode STRING, centername STRING,qscode STRING) USING hive")
#追加
spark.sql("insert into table oracledb.t_lnt_basic_center_code select centercode,centername, qscode FROM t_centercode ")
#重写
#spark.sql("insert overwrite table oracledb.t_lnt_basic_center_code select centercode,centername, qscode FROM t_centercode where centercode ='01'")
# DataFrame转换成RDD
#result = employee_result.rdd.map(lambda p: "centercode: " + p.centercode + " centername: " + p.centername+" qscode"+p.qscode).collect()
#打印RDD数据
# for centercode in result:
# print(centercode)