PySpark step on the practical application of the pit

PySpark stepped pit Record

PySpark is Apache Spark support tool written in python, python simple and practical, using python operating spark to do simple data analysis is also an option. Cassandra is a distributed NoSQL database system, it can be simple to understand and HBase and MongoDB similar. In this paper, practical application and Share PySpark pit Cassandra encountered by PySpark operation.

1. spark-cassandra-connector

First, according to spark, cassandra version selected connector version, reference links:
[https://github.com/datastax/spark-cassandra-connector]

Add a connector, you can add tasks submitted:

./bin/pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.2

You can also add python script:

os.environ["PYSPARK_SUBMIT_ARGS"] = '--packages com.datastax.spark:spark-cassandra-connector_2

2. Configure python environment spark master and worker nodes

Python version of the first issue is a pit.
If you write code in python3, but the default python2 the master and worker, python error will be reported at this time to run. For use java and scala, spark-submit uploaded to lay the jar package can be submitted to the Executive, because the driver and executor are based carrier JVM to run and perform tasks. For PySpark, Spark is at the periphery of the package layer Python API, the aid Py4j achieve Python and Java interaction.

Therefore, if the submitted script python version 3, you need to install python3 in the master and worker nodes and node configuration python path in python script:

os.environ["PYSPARK_PYTHON"] = '/bin/python3'
os.environ["PYSPARK_DRIVER_PYTHON"] = '/bin/python3'

python path here, is the python path for the master and the worker nodes, regardless of the path of a machine uploaded spark python task in.

3. PySpark operation Cassandra

Compare the recommended sparksession target application sparksql, create spark dataset or spark dataframe.

Sample as follows:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(sys.argv[0])\
    .config("spark.cassandra.connection.host", "cassandra_host_ip").getOrCreate()
# 加载Cassandra数据到dataframe
df = spark.read.format("org.apache.spark.sql.cassandra")\
    .options(table="table_name", keyspace="keyspace_name").load()
# UDF
from pyspark.sql.functions import udf
class UdfSample(object):
 ......
udf1 = udf(lambda z: UdfSample(z)......, StringType())
# dataframe写入Cassandra
df.write.format("org.apache.spark.sql.cassandra")\
    .options(table="table_name ", keyspace="keyspace_name")\
    .mode('append').save()

Dataframe in a particularly large number of applications will udf and lambda. For custom udf, serialization, deserialization, and loss of communication IO performance obvious, the Spark sql udf built decrease data between the python worker executor jvm and deserializing serialized, and other communications loss.
and spark dataframe pandas dataframe are two different data formats, although there are two methods implemented into each other, but not practical, and waste of resources.

4. crontab regular tasks can not be performed

Use spark-submit submit python script to spark clusters, such as

./bin/spark-submit  --master spark://ip:port  ./python1.py

The same command to manually enter the run no problem, but can not perform on the crontab.

The essential problem and has nothing to do crontab, this problem is python environment. If you add print sys.path code in python script, you will find a different path in python sys.path performed manually and crontab execution. Simple and direct method is to add local submitting commands to run python path.

./bin/spark-submit 
--master spark://ip:port 
--conf spark.pyspark.python=/bin/python 
--conf spark.pyspark.driver.python=/bin/python 
./python1.py

In general

For simple tasks offline, you can use PySpark submitted rapid deployment, simple and convenient. But in the large data scenario, between JVM and Python frequent communication process can cause performance loss, as much as possible with caution PySpark. Follow-up problems encountered in practice, will continue to add.

Published an original article · won praise 0 · Views 82

Guess you like

Origin blog.csdn.net/hnudagger/article/details/104083780