Pyspark basic introduction 5_RDD persistence method

Pyspark

Note: If you think the blog is good, don’t forget to like and collect it. I will update the content related to artificial intelligence and big data every week. Most of the content is original, Python Java Scala SQL code, CV NLP recommendation system, etc., Spark Flink Kafka, Hbase, Hive, Flume, etc. are all pure dry goods, and the interpretation of various top conference papers makes progress together.
Today, I will continue to share with you Pyspark Basic Introduction 5
#博学谷IT学报技术支持


`


foreword

What I share with you today is the persistence method of Spark RDD.


1. RDD cache

Cache:
Generally, when the calculation of an RDD is very time-consuming | expensive (the calculation rules are more complicated), or the RDD needs to be used repeatedly (multiple parties), at this time, the calculated result of the RDD can be cached for subsequent use , so as to improve efficiency
. The fault tolerance of RDD can also be improved through caching. When the subsequent calculation fails, try not to allow RDD to trace back all the dependency chains, thereby reducing the recalculation time

Note:
Cache is only a temporary storage. Cache data can be saved to memory (executor memory space) or to disk, and even supports saving cache data to off-heap memory (system content other than executor) due to
temporary Storage, there may be data loss, so the cache operation will not truncate (lose) the dependencies between RDDs, because when the cache fails, it can be recalculated based on the original dependencies

缓存的API都是LAZY的, 如果需要触发缓存操作, 必须后续跟上一个action算子, 一般建议使用count

如果不添加action算子, 只有当后续遇到第一个action算子后, 才会触发缓存

2. Use steps

API for setting the cache: rdd.cache(): The caching operation
can only cache the data in the memory
Define the cache location

API for manually clearing the cache:
rdd.unpersist()

By default, when the entire Spark application is executed, the cache will be automatically invalidated and automatically deleted

Commonly used cache levels:
MEMORY_ONLY: only cached in memory
DISK_ONLY: only cached in disk
MEMORY_AND_DISK: memory + disk cached in memory first, when memory is insufficient, the remaining data is cached in disk
OFF_HEAP: cached in off-heap memory

最为常用的: MEMORY_AND_DISK

1. Demonstrate the use of the cache

insert image description here

import time

import jieba
from pyspark import SparkContext, SparkConf, StorageLevel
import os

# 锁定远端环境, 确保环境统一
os.environ['SPARK_HOME'] = '/export/server/spark'
os.environ['PYSPARK_PYTHON'] = '/root/anaconda3/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/root/anaconda3/bin/python3'
"""
    清洗需求: 
	    需要先对数据进行清洗转换处理操作, 清洗掉为空的数据, 
	    以及数据字段个数不足6个的数据, 并且将每一行的数据放置到一个元组中, 
	    元组中每一个元素就是一个字段的数据
"""


def xuqiu1():
    # 需求一:  统计每个关键词出现了多少次, 获取前10个
    res = rdd_map \
        .flatMap(lambda field_tuple: jieba.cut(field_tuple[2])) \
        .map(lambda keyWord: (keyWord, 1)) \
        .reduceByKey(lambda agg, curr: agg + curr) \
        .sortBy(lambda res_tup: res_tup[1], ascending=False).take(10)
    print(res)


def xuqiu2():
    res = rdd_map \
        .map(lambda field_tuple: ((field_tuple[1], field_tuple[2]), 1)) \
        .reduceByKey(lambda agg, curr: agg + curr) \
        .top(10, lambda res_tup: res_tup[1])
    print(res)


if __name__ == '__main__':
    print("Spark的Python模板")

    # 1. 创建SparkContext核心对象
    conf = SparkConf().setAppName('sougou').setMaster('local[*]')
    sc = SparkContext(conf=conf)

    # 2. 读取外部文件数据
    rdd = sc.textFile(name='file:///export/data/workspace/ky06_pyspark/_02_SparkCore/data/SogouQ.sample')

    # 3. 执行相关的操作:
    # 3.1 执行清洗操作
    rdd_filter = rdd.filter(lambda line: line.strip() != '' and len(line.split()) == 6)

    rdd_map = rdd_filter.map(lambda line: (
        line.split()[0],
        line.split()[1],
        line.split()[2][1:-1],
        line.split()[3],
        line.split()[4],
        line.split()[5]
    ))

    # 由于 rdd_map 被多方使用了, 此时可以将其设置为缓存
    rdd_map.persist(storageLevel=StorageLevel.MEMORY_AND_DISK).count()

    # 3.2 : 实现需求
    # 需求一:  统计每个关键词出现了多少次, 获取前10个
    # 快速抽取函数:  ctrl + alt + M
    xuqiu1()
    
    # 当需求1执行完成, 让缓存失效
    rdd_map.unpersist().count()

    # 需求二:统计每个用户每个搜索词点击的次数
    xuqiu2()

    time.sleep(100)

3. RDD checkpoint checkpoint

checkpoint比较类似于缓存操作, 只不过缓存是将数据保存到内存 或者 磁盘上, 而checkpoint是将数据保存到磁盘或者HDFS(主要)上
checkpoint提供了更加安全可靠的持久化的方案, 确保RDD的数据不会发生丢失, 一旦构建checkpoint操作后, 会将RDD之间的依赖关系(血缘关系)进行截断,后续计算出来了问题, 可以直接从检查点的位置恢复数据

主要作用: 容错 也可以在一定程度上提升效率(性能) (不如缓存)
	在后续计算失败后, 从检查点直接恢复数据, 不需要重新计算

Related API:
Step 1: Set checkpoint save data location
sc.setCheckpointDir('path address')

第二步: 在对应RDD开启检查点
	rdd.checkpoint()
	rdd.count()

注意: 
	如果运行在集群模式中, checkpoint的保存的路径地址必须是HDFS, 如果是local模式 可以支持在本地路径
	checkpoint数据不会自动删除, 必须同时手动方式将其删除掉
import time

from pyspark import SparkContext, SparkConf
import os

# 锁定远端环境, 确保环境统一
os.environ['SPARK_HOME'] = '/export/server/spark'
os.environ['PYSPARK_PYTHON'] = '/root/anaconda3/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/root/anaconda3/bin/python3'

if __name__ == '__main__':
    print("演示checkpoint相关的操作")

    # 1- 创建SparkContext对象
    conf = SparkConf().setAppName('sougou').setMaster('local[*]')
    sc = SparkContext(conf=conf)

    # 开启检查点, 设置检查点的路径
    sc.setCheckpointDir('/spark/chk') # 默认的地址为HDFS
    # 2- 获取数据集
    rdd = sc.parallelize(['张三 李四 王五 赵六', '田七 周八 李九 老张 老王 老李'])

    # 3- 执行相关的操作:  以下操作仅仅是为了让依赖链条更长, 并没有太多的实际意义
    rdd1 = rdd.flatMap(lambda line: line.split())

    rdd2 = rdd1.map(lambda name: (name, 1))

    rdd3 = rdd2.map(lambda name_tuple: (f'{name_tuple[0]}_itcast', name_tuple[1]))

    rdd3 = rdd3.repartition(3)

    rdd4 = rdd3.map(lambda name_tuple: name_tuple[0])

    # RDD4设置检查点:
    rdd4.checkpoint()
    rdd4.count()


    rdd5 = rdd4.flatMap(lambda name: name.split('_'))
    rdd5 = rdd5.repartition(4)

    rdd6 = rdd5.map(lambda name: (name, 1))

    rdd_res = rdd6.reduceByKey(lambda agg, curr: agg + curr)

    print(rdd_res.collect())

    time.sleep(1000)

Fourth, the difference between cache and checkpoint

1- Different storage locations:
Cache: Stored in memory or disk or off-heap memory
Checkpoint: Data can be stored on disk or HDFS. In cluster mode, it can only be saved to HDFS

2- Blood relationship:
Cache: The blood relationship between RDDs will not be cut off, because the cached data may become invalid. When it becomes invalid, it is necessary to retrace the calculation operation
checkpoint: the blood relationship between RDDs will be cut off, because the checkpoint Save the data to a more secure and reliable location, thinking that the data will not be lost, and when the execution fails, there is no need to retrace the calculation

3- Life cycle:
Cache: When the program execution is completed, or manually scheduled unpersist cache will be deleted
Checkpoint: Even after the program exits, the checkpoint data still exists and will not be deleted, it needs to be manually deleted

一般建议将两种持久化的方案一同作用于项目环境中, 先设置缓存 然后再设置检查点, 最后统一触发执行(底层: 会将数据先缓存好, 然后将缓存好的数据, 保存到checkpoint对应的路径中, 后续在使用的时候, 优先从缓存中读取, 如果缓存中没有, 会从checkpoint中获取, 同时再把读取数据放置到缓存中)

Summarize

Today I shared with you two persistence methods of RDD.

Guess you like

Origin blog.csdn.net/weixin_53280379/article/details/129376210