我们用Spark处理完数据之后就需要将数据进行落地,比如保存为文本或者插入数据库,这样可以方便之后流程如页面可视化的处理。保存的数据类型有:
- text
- csv
- json
- jdbc
- hive
- hbase
- kafka
- elasticsearch
text
csv
json
parquet
将这四种保存方式放到一起,是因为保存方式基本相同,为了便于说明保存数据的使用,数据来源采用自己制造的数据集,
接下来编写代码来实现
from pyspark.sql import SparkSession
if __name__ == '__main__':
spark = SparkSession\
.builder\
.appName("saveDatas")\
.master("local[*]")\
.getOrCreate()
schema = ['name', 'age']
datas = [('Jack', 27), ('Rose', 24), ('Andy', 32)]
peopledf = spark.createDataFrame(datas, schema)
peopledf.show()
# +----+---+
# |name|age|
# +----+---+
# |Jack| 27|
# |Rose| 24|
# |Andy| 32|
# +----+---+
peopledf.select("name").write.text("/home/llh/data/people_text")
# Jack
# Rose
# Andy
peopledf.write.csv("/home/llh/data/people_csv", sep=':')
# Jack:27
# Rose:24
# Andy:32
peopledf.write.json("/home/llh/data/people_json", mode='overwrite')
# {"name": "Jack", "age": 27}
# {"name": "Rose", "age": 24}
# {"name": "Andy", "age": 32}
peopledf.write.parquet("/home/llh/data/people_parquet", mode='append')
# ...
spark.stop()
jdbc
jdbc数据库可以是mysql、oracle、tidb等等,基本是一致的,我们这里以mysql为例进行说明,在保存数据到数据库时,有四种不同的保存模式
SaveMode.Append => "append" => 追加
SaveMode.Overwrite => "overwrite" => 覆盖
SaveMode.ErrorIfExists => "error" | "errorifexists" =>报异常
SaveMode.Ignore => "ignore" => 忽略
在使用过程中会发现没有update,也就是不能更新数据库的数据,要想实现这个必须修改Spark源码,可以参考Spark源码实现MySQL update
编写代码将数据保存到数据库
from pyspark.sql import SparkSession
if __name__ == '__main__':
spark = SparkSession\
.builder\
.appName("saveDatas")\
.master("local[*]")\
.getOrCreate()
schema = ['name', 'age']
datas = [('Jack', 27), ('Rose', 24), ('Andy', 32)]
peopledf = spark.createDataFrame(datas, schema)
peopledf.show()
# +----+---+
# |name|age|
# +----+---+
# |Jack| 27|
# |Rose| 24|
# |Andy| 32|
# +----+---+
mysql_url = "jdbc:mysql://localhost:3306/test?user=root&password=1"
mysql_table = "people"
peopledf.write.mode("append").jdbc(mysql_url, mysql_table)
spark.stop()
hive
hbase
kafka
elasticsearch
Spark学习目录:
- Spark学习实例1(Python):单词统计 Word Count
- Spark学习实例2(Python):加载数据源Load Data Source
- Spark学习实例3(Python):保存数据Save Data
- Spark学习实例4(Python):RDD转换 Transformations
- Spark学习实例5(Python):RDD执行 Actions
- Spark学习实例6(Python):共享变量Shared Variables
- Spark学习实例7(Python):RDD、DataFrame、DataSet相互转换
- Spark学习实例8(Python):输入源实时处理 Input Sources Streaming
- Spark学习实例9(Python):窗口操作 Window Operations