pyspark runs GraphFrames and reports an error: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
reason
Missing related dependent jar packages
Solution
1. Download the relevant jar package
2. Configure the environment
Specific steps
- Download the dependent jar package : Enter the download website https://spark-packages.org/, select Graph, then select graphframes, and download the dependent jar package corresponding to the spark version;
- Add the jar package to the local spark's jar package dependency folder : I use pyspark installed by pip, so the jar package path is /usr/lib/python2.7/site-packages/pyspark/jars, but see some online People are directly in /usr/local/spark/jar, so you can see where your path is. Determine the path, that is, move the downloaded jar package to the path;
- Configure environment variables : I use the integrated development environment pycharm, so I can configure environment variables directly on the ide. The specific steps are as follows:
Click the drop-down, select Edit configruation
to edit Environment variables
and add environment variables - Specific configuration : Name: PYSPARK_SUBMIT_ARGS
Value: –packages graphframes: graphframes: your graphframes version number (that is, the version number of your jar package) -spark: your spark version number -s_ your scala version number. If you are not sure about the specific spark version number and scala version number, you can run spark-shell on the terminal to check. - .If it is not an integrated development environment, you can set it directly in the program : os.environ[“PYSPARK_SUBMIT_ARGS”] = (
“–packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell”
)