Pyspark mysql connection problems

About PySpark submit a cluster run

Original code is as follows

import os
import sys

project = 'OneStopDataPlatformPY'  # 工作项目根目录
path = os.getcwd().split(project)[0] + project
sys.path.append(path)
print(path)
from pyspark.sql import SparkSession
from org.atgpcm.onestop.common.conf.ConfigurationManager import ConfigurationManager

initConfig = ConfigurationManager().getInitConfig()
driver = initConfig.get('mysql','driver')
jdbcUrl244 = initConfig.get('mysql','244jdbc.url')
jdbcUrl246 = initConfig.get('mysql','246jdbc.url')
user = initConfig.get('mysql','jdbc.user')
password = initConfig.get('mysql','jdbc.password')  #properties["jdbc.password"]

conn244 = {'user':user,'password':password,'driver':'com.mysql.cj.jdbc.Driver'}
conn246 = {'user':user,'password':password,'driver':'com.mysql.cj.jdbc.Driver'}

def start():
    spark_session = SparkSession.builder \
        .master('local[8]') \
        .appName('WordOfMouthStatisticsSpark') \
        .config('spark.jars', '../lib/mysql-connector-java.jar') \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
        .config("spark.default.parallelism", "100")\
        .config("spark.locality.wait", "0")\
        .getOrCreate()
    df = spark_session.read.jdbc(jdbcUrl244, 'all_auto_label', "label_id", 0, 2000000, 200, None, conn244)
    # .filter("state = 1  and sentiment is not null and source_type = 2")
    df.show()

if __name__ == '__main__':
    start()

Modified the code


Here Insert Picture Description

Question one:

py4j.protocol.Py4JJavaError: An error occurred while calling o60.jdbc.
: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
解决方法 :这里由于我环境的问题 , 虽然数据库映射了 外网 ip , 但是在我的集群中无法 tail 该数据库ip+端口 , 但是在windos pycharm 中是可以运行的 ,所以 在提交集群是  换成mysql的内网ip 就可以了  

Question two:

from org.atgpcm.onestop.common.conf.ConfigurationManager import ConfigurationManager
会报错 无法引入 该自定义模块
解决方法:
	from  这里路径填全 import ***

Question three:

 def getInitConfig(self):
        # 生成ConfigParser对象
        config = configparser.ConfigParser()
        # 读取配置文件
        filename = 'config.ini'
        file = os.path.abspath(os.path.join(os.getcwd(), "..", filename))
        print(os.getcwd()+'=========================')
        print(file+'------------------------------')
        config.read(file, encoding='utf-8')
        return config
此处 file = os.path.abspath(os.path.join(os.getcwd(), "..", filename)) 在集群中执行时 找不到 和是的路径

There is a solution to the above problems

/home/bigdata/spark-2.2.0-bin-hadoop2.7/bin/spark-submit \
--master spark://bigdat01:7077 \
--executor-memory 4g \
--total-executor-cores 8 \
--driver-memory 3g \
--py-files /home/bigdata/sparkJar/OneStopDataPlatformPY.zip \
/home/bigdata/sparkJar/OneStopDataPlatformPY/org/atgpcm/onestop/spark/test1.py

特别注意 这里提交命令的执行位置  需要在 test1.py该文件的位置执行
Published an original article · won praise 0 · Views 23

Guess you like

Origin blog.csdn.net/TR_0323/article/details/104479665