A way to solve version incompatibility when executing spark.sql

scene description

The code for the import and export function of the hive data table is as follows. Use assemble to package the Java program and spark-related dependencies into a jar package, and finally use spark-submit to submit the jar to the cluster for execution.

public class SparkHiveApplication {
    
    

    public static void main(String[] args){
    
    

        long start = System.currentTimeMillis();
        String writeSql = "";
        SparkConf sparkConf = new SparkConf();

        for (String arg : args) {
    
    
            if (arg.startsWith("WriteSql=")) {
    
    
                writeSql = arg.replaceFirst("WriteSql=", "");
            }
        }

        SparkSession spark = SparkSession
                .builder()
                .appName("write data to hive table")
                .config(sparkConf)
                .enableHiveSupport()
                .getOrCreate();

        // LOAD DATA LOCAL INPATH '/path/to/file.csv' INTO TABLE target_table PARTITION (field='x')
        spark.sql(writeSql);

        long end = System.currentTimeMillis();
        System.out.println("cost time:" + (end - start));
    }
}

  <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_2.11</artifactId>
      <version>2.4.8</version>
  </dependency>

In the CDH6.3.2 cluster (hereinafter referred to as CDH), when the program executes spark.sql to import the local disk csv data into the hive table, an exception occurs (as shown below), but the export table data to the local disk and the import and export functions from HDFS are normal. .

Caused by: java.lang.IllegalArgumentException: Wrong FS: file:/input/data/training/csv_test1_1301125633652294217_1690451941587.csv, expected: hdfs://nameservice1
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)

After checking the data, it was determined that it was caused by the incompatibility of the spark-hive_2.11 version. During the debugging process, exceptions occurred one after another (as follows)

Exception in thread "main" org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table csv_test2. Invalid method name: 'get_table_req';

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
        at java.lang.Class.getDeclaredConstructors0(Native Method)

The initial local disk import exception was finally solved using spark-hive_2.1.1: 2.4.0-cdh6.3.3.

Then using the jar package containing spark-hive_2.1.1: 2.4.0-cdh6.3.3 dependency, an exception was thrown when importing and exporting in the CDP cluster (another big data cluster). Modify the dependency version to spark-hive_2.11: 2.4 .8, exception resolution.

java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.alterTable(java.lang.String, org.apache.hadoop.hive.ql.metadata.Table, org.apache.hadoop.hive.metastore.api.EnvironmentContext)

At this time, the versions of some components involved in import and export in the two clusters are as follows:

cluster	spark	hive	spark-hive_2.1 in Java
CDH	3.0.x	2.1.1	2.4.0-cdh6.3.3
CDP	3.0.x	3.1.3	2.4.8

Note: The import and export operations are performed using spark on k8s, so the spark 3.0 in the image is used instead of the spark installed on the CDH or CDP cluster.

Abnormal cause analysis

spark.sql has to do three things when executing:

Spark first creates the hiveMetaStoreClient object;
Then call the hiveMetaStoreClient method to communicate with hiveMetastoreServer in CDH (CDP) to obtain table-related metainformation.
Generate sql execution plan based on the obtained information to actually process the data.

To generate an object jvm, you first need to find the corresponding Class file through the fully qualified class name, construct the object through reflection, and then execute the object method. The problem is also here: the package name + class name are the same, but different versions may have different method names, method parameters, and method content, corresponding occurrences Invalid method name: 'get_table_req' , java.lang.NoSuchMethodExceptionand exceptions thrown when the method is executed.

Changing the dependency version in the scenario description actually means finding the appropriate hiveMetastore version and letting the jvm load it first . 2.4.0-cdh6.3.3 contains hive-metastore:2.1.1-cdh6.3.3 internally, and 2.4.8 internally contains hive-metastore:1.2.1spark2.

Another solution

Spark1.4.0 and later versions support interaction with different versions of Hive Metastore . The list posted is for spark 3.4.1 compatible hive meatstore versions 0.12.0 to 2.3.9 and 3.0.0 to 3.1.3. Compatibility of different versions can be viewed in the official documentation .

Insert image description here

How to configure interaction with different versions of hive metastore?

(1) Built-in. Spark has built-in hive. If it is not included in the application jar package or specified externally, the built-in one will be used by default. The built-in hive versions of different versions of spark are also different. spark3.4.1 has built-in hive2.3.9, and spark3.0.3 has built-in hive2.3.7. When using spark.sql in spark-shell, you should use the built-in one, because there will be no Java jar package, and starting it is just typing "spark-shell" on the command line.

(2) Download on the spot. Configuration spark.sql.hive.metastore.version=2.1.1 and spark.sql.hive.metastore.jars=maven, when spark.sql is executed, the 2.1.1 related dependencies will be downloaded from the maven warehouse to the local /root/.livy/jars path. There are about 188 jar packages with a total size of about 200M. However, this method will fail to download when the network speed is very slow or the maven warehouse does not have certain dependencies, and downloading on the spot is not suitable for production environments.

(3) Specify the version and dependency path.

Configured before spark 3.1.0 spark.sql.hive.metastore.version=2.1.1 and spark.sql.hive.metastore.jars=/path-to-hive-jars/*. When executing spark.sql, it will first look for dependencies from the path-to-hive-jars path.
After spark 3.1.0, configuration spark.sql.hive.metastore.version=2.1.1 , spark.sql.hive.metastore.jars=path, is required spark.sql.hive.metastore.jars.path=path-to-hive-jars. "path-to-hive-jars" can be a path on HDFS. See the table for details.

This method can be used in a production environment.

If you adopt method (3), how can you obtain the correct dependencies in advance, so that they are compatible with spark and can communicate with cluster hive without any problem?

Which cluster to operate on if the hive cluster is within the spark version compatibility range. Just "pump" all the jar packages (about 200M) under the cluster hive/lib to spark. (You may not use that much, but screening requires experimental testing).

The following is the spark-submit command when performing an import operation in a CDH cluster. Take out the jar package under CDH's hive/lib in advance and mount it to the /opt/ml/input/data/training/sparkjar/hive-jars path of the container.

#  在 k8s 容器中执行
/usr/local/spark/bin/spark-submit \
--conf spark.driver.bindAddress=172.16.0.44 \
--deploy-mode client \
--conf spark.sql.hive.metastore.jars=/data/training/sparkjar/hive-jars/* \
--conf spark.sql.hive.metastore.version=2.1.1 \
--properties-file /opt/spark/conf/spark.properties \
--class com.spark.SparkHiveApplication \
local:///data/training/sparkjar/hive-metastore-spark-app-jar-with-dependencies.jar \
WriteSql=TE9BRCBEQVRBIExPQ0FMIElOUEFUSCAnL29wdC9tbC9vdXRwdXQvMTc1NjQ2NDY2MDY3Mzk4NjU3LzE3NTY0NjQ2NjA2NzM5ODY1Ny9wYXJ0LTAwMDAwLWVhYjA2ZWZiLTcwNTktNGI4MS04YmRhLWE3NGE5Yzg3OTY2MS1jMDAwLmNzdicgSU5UTyBUQUJMRSBkdF90aW9uZV90ZXN0XzIwMjIwNzIyIHBhcnRpdGlvbiAocGFydF9udW09JzEnKQ==

When combined with the project, you will definitely be able to obtain all the jar packages and find the appropriate "fighting" method. What is listed here is just one way to add dependencies to spark tasks.

Try to be "skinny"

When creating the assembly jar, set the life cycle of spark-hive_2.1 to provided, that is, the dependency will not be included in the final jar package. Because the cluster manager can provide dependent jars by itself when running jar tasks. Moreover, the life cycle of spark-hive on the maven official website is given as provided.

The size of the jar package without spark-hive dependency is 9M (previously 144M), and the import and export operations are performed on CDP and CDH respectively. result:

CDP cluster test passed.
CDH cluster exception. The guess is that the native spark3 is incompatible with hive-metastore:2.1.1-cdh6.3.3 (distributions sometimes make changes based on the native version). After using the configuration in method (3), the import and export functions are normal.

If the cluster is deployed using a release version, the components are more likely to be compatible with larger versions. Moreover, when debugging Java jar functions frequently, the 9M size shortens the upload time and increases the efficiency.

summary

The hiveMetastore used by spark can be specified through configuration. Prioritizing the use of dependencies that come with the cluster can reduce component incompatibility exceptions to a certain extent. In the Java jar package, it only matters how the application is written, and the dependencies are provided by the cluster, which can remove the strong binding relationship between the jar package and a certain big data cluster. However, external configuration is only a solution. If it is to be combined with the project, it is necessary to further design the implementation plan and conduct experiments according to the scene requirements.