Packaging code and dependency instructions

During development, the applications we write usually need to rely on third-party libraries (that is, the program introduces dependencies that are neither in the org.apache.spark package nor the language runtime library), and we need to ensure that all dependencies Can be found when the Spark application is running

For Python , there are many ways to install third-party libraries
- You can install the dependent libraries on all machines in the cluster through a package manager (such as pip), or manually install the dependencies to the site-packages/ directory in the python installation directory.
- We can also submit independent libraries through the --py-Files parameter of spark-submit
- If we do not have permission to install packages on the cluster, we can manually add dependent libraries, but we must prevent conflicts with those packages that are already installed on the cluster.

Notice:

When submitting an application, never put spark itself among the submitted dependencies. spark-submit will automatically ensure that spark is in the running path of your program

For Java and Scala , you can submit independent jar package dependencies through the --jars flag of spark-submit.
- This approach is more appropriate when there are only simple dependencies on one or two libraries, and these libraries do not depend on other libraries.
- This method is clumsy and not suitable when you need to rely on the use of many libraries.
  - The common practice at this time is to use build tools (such as maven, sbt) to generate a relatively large jar package. This jar package contains all the transitive dependencies of the application.

Using Maven to build Spark Application written in Java

ReferencePOM

<repositories>
    <!-- 指定仓库的位置，依次为aliyun、cloudera、jboss -->
    <repository>
        <id>aliyun</id>
        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    </repository>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
    <repository>
        <id>jboss</id>
        <url>https://repository.jboss.com/nexus/content/groups/public/</url>
    </repository>
</repositories>


<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>8</maven.compiler.source>
    <maven.compiler.target>8</maven.compiler.target>

    <scala.version>2.12.15</scala.version>
    <scala.binary.version>2.12</scala.binary.version>

    <hadoop.version>3.1.3</hadoop.version>
    
    <spark.version>3.2.0</spark.version>
    <spark.scope>compile</spark.scope>  
</properties>


<dependencies>
    <!-- 依赖Scala语言-->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
    </dependency>
    <!-- Spark Core 依赖 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <!-- Hadoop Client 依赖 -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
</dependencies>


<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.10.1</version>
            <configuration>
                <source>${maven.compiler.source}</source><!-- 源代码使用的JDK版本 -->
                <target>${maven.compiler.target}</target><!-- 需要生成的目标class文件的编译版本 -->
                <encoding>${project.build.sourceEncoding}</encoding><!-- 字符集编码 -->
            </configuration>
        </plugin>
    </plugins>
</build>

Using Maven to build Spark Application written in Scala

ReferencePOM

<repositories>
    <!-- 指定仓库的位置，依次为aliyun、cloudera、jboss -->
    <repository>
        <id>aliyun</id>
        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    </repository>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
    <repository>
        <id>jboss</id>
        <url>https://repository.jboss.com/nexus/content/groups/public/</url>
    </repository>
</repositories>

<properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <scala.version>2.13.5</scala.version>
    <scala.binary.version>2.13</scala.binary.version>
    <spark.version>3.2.0</spark.version>
    <hadoop.version>3.1.3</hadoop.version>
</properties>

<dependencies>
    <!-- 依赖Scala语言-->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
    </dependency>
    <!-- Spark Core 依赖 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <!-- Hadoop Client 依赖 -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <!--maven的打包插件-->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.0.0</version>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        <!--该插件用于将scala代码编译成class文件-->
        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.2.2</version>
            <executions>
                <!--绑定到maven的编译阶段-->
                <execution>
                    <goals>
                        <goal>compile</goal>
                        <goal>testCompile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Use sbt to build Spark Application written in Scala

Not currently in use, not recorded yet

dependency conflict

When our Spark Application and Spark itself depend on the same library, dependency conflicts may occur, causing the program to crash.

Dependency conflicts usually appear as:

NoSuchMethodError
ClassNotFoundException
or other class loading related JVM exceptions

There are two main solutions to this type of problem:

1) Modify the Spark Application so that the dependent library version it uses is the same as that used by Spark

2) Usually use "shading" method to package our Spark Application

02-Packaging code and dependencies

Packaging code and dependency instructions

Using Maven to build Spark Application written in Java

ReferencePOM

Using Maven to build Spark Application written in Scala

ReferencePOM

Use sbt to build Spark Application written in Scala

dependency conflict

Guess you like