java.io.IOException Pyspark encountered: Not a file and pyspark.sql.utils.AnalysisException: 'Table or view not found

When recently performed pyspark, directly read the data inside the hive , often encountered several problems:

Java.io.IOException 1.: Not A File - But the fact that the file exists, the default path hdfs went wrong, you need to configure --files and --conf.

Pyspark.sql.utils.AnalysisException 2.: ' the Table or View not found - in fact in the hive which is the existence of this table, but the display can not be found.

Org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException 3.: Database 'xxxxx' Not found ; - direct display database did not find, because there is no configuration enableHiveSupport () , for example, to set: spark = SparkSession.builder. master ( "local"). appName ( "SparkOnHive"). enableHiveSupport (). getOrCreate ()

The reason for these problems:

I tested here is : when the hive configuration and spark-submit job is submitted , the parameters are not set cause.

# 以下测试通过
import os
# from pyspark import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql import HiveContext
import sys

os.environ["PYSPARK_PYTHON"]="/usr/bin/python3"

spark = SparkSession.builder.master("local").appName("SparkOnHive").enableHiveSupport().getOrCreate() #必须配置 enableHiveSupport
hive_text = HiveContext(spark)
Print (sys.getdefaultencoding ()) 

hive_text.sql ( ' use default ' )   # select the database name 
data_2 = hive_text.sql ( " the SELECT * from word_test " ) # execute queries 
# data_2 = hive_text.sql ( "the SELECT * from test_table_words ") 
data_3 = data_2.select ( " first_column_name " ) .collect () # a column selection table, or a few columns, a column name to enter 

Print (data_3 [0] [0]) 

Print (data_2.collect () [0])   
 Print (data_2.collect () [0] [0]) 
 Print ( " ------ ------ Finished " )

Save for the above code: test.py

When the command line or shell execution, use:

spark-submit --files /opt/spark/spark-2.1.1-bin-hadoop2.7/conf/hive-site.xml --conf spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true test.py

among them:

--files: add hive configuration file, first note: your hadoop, spark, hive, etc. are installed configured, it is recommended to install spark_with_hive and the like have been compiled spark.

--conf: a critical configuration, this configuration still learning

 

After successfully performing the above method, when the second execution found : spark-submit --files /opt/spark/spark-2.1.1-bin-hadoop2.7/conf/hive-site.xml test.py may be directly execution, a little ignorant, may have set --conf in the hive-site.xml , continued learning.

Note : In this case you do not need the code installed inside spark.sql.warehouse.dir : config ( "spark.sql.warehouse.dir", some_path)

Guess you like

Origin www.cnblogs.com/qi-yuan-008/p/12057350.html