pyspark在windows加载数据集训练模型出现以下错误 Connection reset by peer: socket write error

As a workaround you might try the following change to python/pyspark/worker.py

Add the following 2 lines to the end of the process function defined inside the main function

for obj in iterator:
     pass

... so the process function now looks like this (in spark 1.5.2 at least):

      
     def process():
            iterator = deserializer.load_stream(infile)
            serializer.dump_stream(func(split_index, iterator), outfile)
            for obj in iterator:
                pass

After the change you will need to rebuild your pyspark.zip in the python/lib folder to include the change.

The issue may be that the worker process is completing before the executor has written all the data to it. The thread writing the data down the socket throws an exception and if this happens before the executor marks the task as complete it will cause trouble. The idea is to try to get the worker to pull all the data from the executor even if its not needed to lazily compute the output. This is very inefficient of course so it is a workaround rather than a proper solution.

pyspark在windows加载数据集训练模型出现 以下错误 Connection reset by peer: socket write error

猜你喜欢

pyspark在windows加载数据集训练模型出现以下错误 Connection reset by peer: socket write error