spark-submit the use pyspark

In linux , the execution pyspark code - Found Effective

1. Installation pycharm or spyder, and then write the code in which the statement is executed

2. be furnished by Job, i.e. spark-submit to submit the main stresses following this approach

First assume that he wrote the * .py file contains these packages, that is, through import import

import os
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql import HiveContext
import jieba
from collections import Counter
from operator import itemgetter
import time
import ast
from pyspark.sql.types import StructField, StructType, StringType
from pyspark.sql import SQLContext

3. So when submitted in the spark-submit, it is necessary to contain all the required packaged in a zip file, it is noted that: the need to first packet in the same directory , then the directory a zip files together, e.g. in case there are more packets to be:

First: create a folder to hold all packages:

mkdir lib_words

Second: Copy the desired package (generally under lib python installation directory, in which third-party libraries in site-packages inside) to this folder because one option is too much trouble, so we packed together, but not copy and packaging pyspark library package

cp -r /usr/local/python3.7/lib/python3.7/* /home/lib_words
cp -r /usr/local/python3.7/lib/python3.7/site-packages/* /home/lib_words

Third: zip package

zip -r /home/lib_words.zip ./*

4. In the command line, using spark-submit to submit * .py master file, and the parameter '--py-files' import zip file, and then execute the transport

spark-submit /home/pycharm_projects/cut_words/cut_words_fre.py --py-files='/home/lib_words.zip'

 

Additional : First, by writing a program in which parameters directly pyFiles (SparkContext the parameter) and then submit run directly: spark-submit /home/pycharm_projects/cut_words/cut_words_fre.py, may be Found

= pyFiles [ " /home/lib_words.zip " ] # the path of the compressed packet, viable 
# pyFiles = [ "/ Home / test1.py", "/ Home / test2.py"] # is said to be, but because too many files, not tested 
sc = SparkContext ( ' local ' , ' the test ' , pyFiles = pyFiles)

Last, there is a line will appear:

19:55:06 INFO spark.SparkContext: Successfully stopped SparkContext

 

Note: If only pyspark the package, you may not need to add * .zip file (not tested)

reference:

https://blog.csdn.net/lmb09122508/article/details/84586947

https://blog.csdn.net/MrLevo520/article/details/86738109

https://blog.csdn.net/qq_23860475/article/details/90479702

Guess you like

Origin www.cnblogs.com/qi-yuan-008/p/11877805.html