spark读Hadoop文本文件到hive

环境
centos:7.2
python2:2.7.5
python3:3.6.5
spart:2.2.0
ambari:2.6.1
hdp:2.6.4


一、指定pyspark的python 通过python3运行
1.软连接python3到/usr/bin/目录下
ln -s /usr/local/python3/bin/python3 /usr/bin/
2.修改/usr/bin/pyspark文件
在文件中增加:
export PYSPARK_PYTHON=python3


二、读取txt文件到spark后保存到HIVE中


# -*- coding: utf-8 -*-  
from __future__ import print_function
from pyspark.sql import SparkSession  
from pyspark.sql import Row  
  
if __name__ == "__main__":  
    # 初始化SparkSession  
    spark = SparkSession \
        .builder \
        .appName("TextToHive") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()  
  
    sc = spark.sparkContext  
  
    lines = sc.textFile("hdfs://10.250.11.52:8020/source/db/centercode/20180507")
    parts = lines.map(lambda l: l.split(","))  
    centercode = parts.map(lambda p: Row(centercode=p[0], centername=p[1],qscode=p[2]))
  
    #RDD转换成DataFrame  
    centercode_temp = spark.createDataFrame(centercode)
  
    #显示DataFrame数据  
    centercode_temp.show()
  
    #创建视图  
    centercode_temp.createOrReplaceTempView("t_centercode")
    #过滤数据  
    employee_result = spark.sql("SELECT centercode,centername, qscode FROM t_centercode")
#CREATE  HIVE TABLE
spark.sql("CREATE TABLE IF NOT EXISTS oracledb.t_lnt_basic_center_code (centercode STRING, centername STRING,qscode STRING) USING hive")
#追加
spark.sql("insert into table oracledb.t_lnt_basic_center_code select centercode,centername, qscode FROM t_centercode ")
#重写
#spark.sql("insert overwrite table oracledb.t_lnt_basic_center_code select centercode,centername, qscode FROM t_centercode where centercode ='01'")
  
    # DataFrame转换成RDD
    #result = employee_result.rdd.map(lambda p: "centercode: " + p.centercode + "  centername: " + p.centername+" qscode"+p.qscode).collect()
  
    #打印RDD数据
  #  for centercode in result:
   #     print(centercode)



猜你喜欢

转载自blog.csdn.net/qq_39160721/article/details/80251024