1.编写pyspark脚本
步骤
-
- 读取本地csv文件转换为DataFrame
-
- DataFrame注册为spark sql临时表
-
- spark sql()函数查询返回DataFrame数据,或者直接DataFrame
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('my_app_name').getOrCreate()
swimmersCSV = spark.read.csv("file:///home/douyonghou/continuity0916.csv", sep='\t', header=True, )
swimmersCSV.createOrReplaceTempView("swimmersCSV")
swimmersCSV.show()
data = spark.sql("select index from swimmersCSV").show()
2.在spark客户端提交spark应用程序
spark-submit --conf "spark.pyspark.driver.python=/usr/bin/python3.5" --conf "spark.pyspark.python=/usr/bin/python3.5" pysparkTest.py