pyspark uses GraphFrames to report an error

pyspark runs GraphFrames and reports an error: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI

reason

Missing related dependent jar packages

Solution

1. Download the relevant jar package
2. Configure the environment

Specific steps

  1. Download the dependent jar package : Enter the download website https://spark-packages.org/, select Graph, then select graphframes, and download the dependent jar package corresponding to the spark version;
  2. Add the jar package to the local spark's jar package dependency folder : I use pyspark installed by pip, so the jar package path is /usr/lib/python2.7/site-packages/pyspark/jars, but see some online People are directly in /usr/local/spark/jar, so you can see where your path is. Determine the path, that is, move the downloaded jar package to the path;
  3. Configure environment variables : I use the integrated development environment pycharm, so I can configure environment variables directly on the ide. The specific steps are as follows:
    Click the drop-down, select Edit configruation Click the drop-down and select Edit configruation
    to edit Environment variables
    Edit Environment variables
    insert image description here
    and add environment variables
  4. Specific configuration : Name: PYSPARK_SUBMIT_ARGS
    Value: –packages graphframes: graphframes: your graphframes version number (that is, the version number of your jar package) -spark: your spark version number -s_ your scala version number. If you are not sure about the specific spark version number and scala version number, you can run spark-shell on the terminal to check.
  5. .If it is not an integrated development environment, you can set it directly in the program : os.environ[“PYSPARK_SUBMIT_ARGS”] = (
    “–packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell”
    )

Guess you like

Origin blog.csdn.net/qq_15098623/article/details/91533349