As a workaround you might try the following change to python/pyspark/worker.py
Add the following 2 lines to the end of the process function defined inside the main function
for obj in iterator: pass
... so the process function now looks like this (in spark 1.5.2 at least):
def process(): iterator = deserializer.load_stream(infile) serializer.dump_stream(func(split_index, iterator), outfile) for obj in iterator: pass
After the change you will need to rebuild your pyspark.zip in the python/lib folder to include the change.
The issue may be that the worker process is completing before the executor has written all the data to it. The thread writing the data down the socket throws an exception and if this happens before the executor marks the task as complete it will cause trouble. The idea is to try to get the worker to pull all the data from the executor even if its not needed to lazily compute the output. This is very inefficient of course so it is a workaround rather than a proper solution.