spark-submit python egg problem solving are dependent tripartite

 Assuming that uses purl spark in this tripartite piece, https: //github.com/ultrabluewolf/p.url, he additionally dependent on futures this tripartite members (six words, anaconda2 own).

pyspark code is as follows:

 

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My test App")
sc = SparkContext(conf=conf)
#from purl import Purl

def get_purl(x):
    from purl import Purl
    url = Purl('https://github.com/search?q={}'.format(x))
    return str(url.add_query('name', 'dog'))

int_rdd = sc.parallelize([1, 2, 3, 4])
r =int_rdd.map(lambda x: get_purl(x))
print(r.collect())

Here's how to compile the package egg.

By https://pypi.org/project/p.url/#files download the source code. Then extract:

python setup.py  bdist_egg

In the dist directory you can see egg file generation.

Similarly, https://pypi.org/project/future/#files futures download the source code, and then extract the files generated egg.

 

The final run:

spark-submit --py-files p.url-0.1.0a4-py2.7.egg,future-0.17.1-py2.7.egg main_dep.py

 Output results:

['https://github.com/search?q=1&name=dog', 'https://github.com/search?q=2&name=dog', 'https://github.com/search?q=3&name=dog', 'https://github.com/search?q=4&name=dog']

 

 

Complement official documents, compare egg pain, did not say the specific operation:

Complex Dependencies

Some operations rely on complex packages that also have many dependencies. For example, the following code snippet imports the Python pandas data analysis library:

def import_pandas(x):
 import pandas
 return x

int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_pandas(x))
int_rdd.collect()

pandas depends on NumPy, SciPy, and many other packages. Although pandas is too complex to distribute as a *.py file, you can create an egg for it and its dependencies and send that to executors.

Limitations of Distributing Egg Files

In both self-contained and complex dependency scenarios, sending egg files is problematic because packages that contain native code must be compiled for the specific host on which it will run. When doing distributed computing with industry-standard hardware, you must assume is that the hardware is heterogeneous. However, because of the required C compilation, a Python egg built on a client host is specific to the client CPU architecture. Therefore, distributing an egg for complex, compiled packages like NumPy, SciPy, and pandas often fails. Instead of distributing egg files you should install the required Python packages on each host of the cluster and specify the path to the Python binaries for the worker hosts to use.

 

 

Guess you like

Origin www.cnblogs.com/bonelee/p/11125481.html