Spark学习实例(Python):保存数据Save Data

我们用Spark处理完数据之后就需要将数据进行落地,比如保存为文本或者插入数据库,这样可以方便之后流程如页面可视化的处理。保存的数据类型有:

  • text
  • csv
  • json
  • jdbc
  • hive
  • hbase
  • kafka
  • elasticsearch

text
csv
json
parquet

将这四种保存方式放到一起,是因为保存方式基本相同,为了便于说明保存数据的使用,数据来源采用自己制造的数据集,

接下来编写代码来实现

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession\
        .builder\
        .appName("saveDatas")\
        .master("local[*]")\
        .getOrCreate()
    schema = ['name', 'age']
    datas = [('Jack', 27), ('Rose', 24), ('Andy', 32)]
    peopledf = spark.createDataFrame(datas, schema)
    peopledf.show()
    # +----+---+
    # |name|age|
    # +----+---+
    # |Jack| 27|
    # |Rose| 24|
    # |Andy| 32|
    # +----+---+
    peopledf.select("name").write.text("/home/llh/data/people_text")
    # Jack
    # Rose
    # Andy
    peopledf.write.csv("/home/llh/data/people_csv", sep=':')
    # Jack:27
    # Rose:24
    # Andy:32
    peopledf.write.json("/home/llh/data/people_json", mode='overwrite')
    # {"name": "Jack", "age": 27}
    # {"name": "Rose", "age": 24}
    # {"name": "Andy", "age": 32}
    peopledf.write.parquet("/home/llh/data/people_parquet", mode='append')
    # ...
    spark.stop()

jdbc

jdbc数据库可以是mysql、oracle、tidb等等,基本是一致的,我们这里以mysql为例进行说明,在保存数据到数据库时,有四种不同的保存模式

SaveMode.Append => "append"  => 追加

SaveMode.Overwrite => "overwrite" => 覆盖

SaveMode.ErrorIfExists => "error" | "errorifexists" =>报异常

SaveMode.Ignore => "ignore" => 忽略

在使用过程中会发现没有update,也就是不能更新数据库的数据,要想实现这个必须修改Spark源码,可以参考Spark源码实现MySQL update

编写代码将数据保存到数据库

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession\
        .builder\
        .appName("saveDatas")\
        .master("local[*]")\
        .getOrCreate()
    schema = ['name', 'age']
    datas = [('Jack', 27), ('Rose', 24), ('Andy', 32)]
    peopledf = spark.createDataFrame(datas, schema)
    peopledf.show()
    # +----+---+
    # |name|age|
    # +----+---+
    # |Jack| 27|
    # |Rose| 24|
    # |Andy| 32|
    # +----+---+
    mysql_url = "jdbc:mysql://localhost:3306/test?user=root&password=1"
    mysql_table = "people"
    peopledf.write.mode("append").jdbc(mysql_url, mysql_table)
    spark.stop()

hive

hbase

kafka

elasticsearch

Spark学习目录:

发布了84 篇原创文章 · 获赞 28 · 访问量 6万+

猜你喜欢

转载自blog.csdn.net/a544258023/article/details/94635807