Spark SQL 和 Hive 的交互

Spark SQL可以读写Hive表

Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
Ref: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#hive-tables

Spark SQL支持大部分Hive函数及特性

Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. Currently Hive SerDes and UDFs are based on Hive 1.2.1, and Spark SQL can be connected to different versions of Hive Metastore
Ref: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#compatibility-with-apache-hive

Spark SQL不支持的Hive特性

Major Hive Features:
Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL doesn’t support buckets yet.

Esoteric Hive Features
UNION type
Unique join
Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment and only supports populating the sizeInBytes field of the hive metastore.

Hive Input/Output Formats
File format for CLI: For results showing back to the CLI, Spark SQL only supports TextOutputFormat.
Hadoop archive

Hive Optimizations
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.
Ref: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#unsupported-hive-functionality

Spark SQL 和 Hive 的交互

Spark SQL可以读写Hive表

Spark SQL支持大部分Hive函数及特性

Spark SQL不支持的Hive特性

猜你喜欢