Spark cluster offline using third-party package

Spark cluster is sometimes necessary to use some third-party packages, such as graphframes, kafka, etc. (The following are graphframes for example). According to official documents, packages usually a command-line option to solve the problem:

$SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.6.0-spark2.2-s_2.11

This command will download graphframes and its dependent jar package from the network, save it to $HOME/.ivy2/jarsthe. But when Spark cluster is offline or in poor network conditions how to handle it?

  1. Find a host of Internet download package.
    Use packages command-line option to download the package from the jar $HOME/.ivy2/jarsextracted in.

  2. If you use pyspark, also you need to extract the relevant python bag.
    For graphframes, it is to graphframes_graphframes-0.6.0-spark2.2-s_2.11.jar decompression, which is packing graphframes folder and added to the zip package environment variable PYTHONPATH.

unzip graphframes_graphframes-0.6.0-spark2.2-s_2.11.jar
zip -r graphframes.zip graphframes
  1. The option packages into jars option to just download jar package are added to the options.

Command line becomes:

export PYTHONPATH=$PYTHONPATH:/path/to/graphframes.zip
$SPARK_HOME/bin/spark-shell --jars /path/to/graphframes:graphframes:0.6.0-spark2.2-s_2.11.jar,/path/to/xxx.jar

Guess you like

Origin blog.csdn.net/weixin_34088598/article/details/91009406