When recently performed pyspark, directly read the data inside the hive , often encountered several problems:
Java.io.IOException 1.: Not A File - But the fact that the file exists, the default path hdfs went wrong, you need to configure --files and --conf.
Pyspark.sql.utils.AnalysisException 2.: ' the Table or View not found - in fact in the hive which is the existence of this table, but the display can not be found.
Org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException 3.: Database 'xxxxx' Not found ; - direct display database did not find, because there is no configuration enableHiveSupport () , for example, to set: spark = SparkSession.builder. master ( "local"). appName ( "SparkOnHive"). enableHiveSupport (). getOrCreate ()
The reason for these problems:
I tested here is : when the hive configuration and spark-submit job is submitted , the parameters are not set cause.
# 以下测试通过 import os # from pyspark import SparkContext, SparkConf from pyspark.sql.session import SparkSession from pyspark.sql import HiveContext import sys os.environ["PYSPARK_PYTHON"]="/usr/bin/python3" spark = SparkSession.builder.master("local").appName("SparkOnHive").enableHiveSupport().getOrCreate() #必须配置 enableHiveSupport hive_text = HiveContext(spark) Print (sys.getdefaultencoding ()) hive_text.sql ( ' use default ' ) # select the database name data_2 = hive_text.sql ( " the SELECT * from word_test " ) # execute queries # data_2 = hive_text.sql ( "the SELECT * from test_table_words ") data_3 = data_2.select ( " first_column_name " ) .collect () # a column selection table, or a few columns, a column name to enter Print (data_3 [0] [0]) Print (data_2.collect () [0]) Print (data_2.collect () [0] [0]) Print ( " ------ ------ Finished " )
Save for the above code: test.py
When the command line or shell execution, use:
spark-submit --files /opt/spark/spark-2.1.1-bin-hadoop2.7/conf/hive-site.xml --conf spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true test.py
among them:
--files: add hive configuration file, first note: your hadoop, spark, hive, etc. are installed configured, it is recommended to install spark_with_hive and the like have been compiled spark.
--conf: a critical configuration, this configuration still learning
After successfully performing the above method, when the second execution found : spark-submit --files /opt/spark/spark-2.1.1-bin-hadoop2.7/conf/hive-site.xml test.py may be directly execution, a little ignorant, may have set --conf in the hive-site.xml , continued learning.
Note : In this case you do not need the code installed inside spark.sql.warehouse.dir : config ( "spark.sql.warehouse.dir", some_path)