1. Define RDD:

1.RDD

It is a distributed, flexible data set, and the most basic unified data format unit in spark applications

RDD partition is a continuous piece of data

Different data sources must have a unified format, and this unified format is RDD.

The calculation of spark is the conversion process of RDD. RDD can be read, RDD cannot be modified, only the RDD can be converted and converted into a new RDD.

Each spark application includes a driver program that runs the user's main functions and performs various parallel operations on the cluster . The main abstraction provided by spark is RDD, which is a collection of elements partitioned across cluster nodes and can be operated in parallel. spark with the underlying Scala wrote, Scala is based on the memory of. Users can also ask spark to keep RDDs in memory, allowing him to effectively reuse them in parallel operations . Finally, RDD will automatically recover from node failures.

Convert from RDD1 to RDD2, and then to RDD3, the calculation process does not directly calculate the specific data, but only calculates this representative. And this conversion relationship is called a dependency relationship.

2. Dependency:

Including wide dependency and narrow dependency
Wide dependency : parent RDD is a one-to-one relationship with child RDD
Narrow dependency : parent RDD is a many-to-many relationship with child RDD

      When the engine spark operation of a dependency creating the DAG RDD
      the DAG directed acyclic graph
      for each such diagram it is a directed acyclic graph DGA, spark created by the engine, will generate a program to create, according to the plan for cutting stages,
        each Create different tasks in each phase.

Encounter wide dependence and disconnection, thus forming a stage
Each DGA converts my job in spark
Each job job is divided into multiple stages
Each stage corresponds to a task set TaskSet
Each task will execute an instance in each partition

Two, spark development programming

Create spark context object sparkcontext, call spark api through sparkcontext, use api to create operator (RDD), transform operator, trigger calculation, and dump calculation result.

1. RDD creation method:

Method 1 :Parallel creationthrough parallelize (parallelize is generally used for self-development and testing)

from pyspark.sql import SparkSession

#通过SparkSession创建一个spark的入口
#wordcount给spark定义的名字
#local[2]跑两个进程（大多数机器为两核）、local[1]、local[*]有几个核起动几个进程
spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()

#创建spark上下文
sc = spark.sparkContext

#paralleze：进行并行化，将自己的集合或者列表转化为RDD格式，一般在自己开发代码是测试使用

#案例1：
	ls = [1, 2, 3, 4, 5, 6, 7, 8, 9]
	rdd = sc.parallelize(ls)          #parallelize 是转换算子，并没有计算（懒操作）
	print(rdd.collect())              #collect是行动算子
	rdd1 =rdd.map(lambda x:x*2)       # map是对每个数据都进行操作，都执行一个函数，x*2是自己定义的一个函数
	print(rdd1.collect())


#案例2：为方便操作集合或者列表，将列表转为RDD
	list = ["Hadoop","Spark","Hive","Spark"]
	rdd = sc.parallelize(list)
	pairRDD = rdd.map(lambda word : (word,1))	#(hadoop,1)  ((Hive,1) (spark,2)
	pairRDD.foreach(print)						#foreach；是行动算子

Method 2 : Use textFile() to read external data to create an RDD

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
sc = spark.sparkContext


"""
从本加载文件数据集,存储到本地
"""
rdd = sc.textFile("D:/WorkSpace/tylg2020/resources/localfile/test.txt")

rdd.saveAsTextFile("D:/WorkSpace/tylg2020/resources/localsavefile/")

"""
从本加载文件数据集,存储到hdfs
"""
rdd = sc.textFile("D:/WorkSpace/tylg2020/resources/localfile/test.txt")
print(type(rdd))
print(rdd.collect())
rdd.saveAsTextFile("hdfs://hadoop001:9000/localToHdfs")

"""
从hdfs载文件数据集,存储到本地
"""
rdd = sc.textFile("hdfs://hadoop001:9000/test.txt")
print(type(rdd))
print(rdd.collect())
rdd.saveAsTextFile("D:/WorkSpace/tylg2020/resources/localsavefile")

"""
从hdfs载文件数据集,存储到HDFS
"""
rdd = sc.textFile("hdfs://hadoop001:9000/test.txt")
print(type(rdd))
print(rdd.collect())
rdd.saveAsTextFile("hdfs://hadoop001:9000/HdfsToHdfs")

map(): is to perform functional operations on each data
foreach: is to operate on each result data, it is an action operator, triggers the operation
collect: action operator, trigger operation
saveAsTextFile: action operator, trigger operation

Method 3: Read the json file

import json

from pyspark import RDD
from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession.builder.appName("tylg").master("local[2]").getOrCreate()
    sc = spark.sparkContext
    """
    创建RDD的方式三：通过读取json文件

    """
    inputFile = "D:/WorkSpace/tylgPython/resources/demo.json"
    jsonStrs = sc.textFile(inputFile)                    #jsonStrs 就是一个RDD
    result = jsonStrs.map(lambda s: json.loads(s))       #load()把其他类型的对象转为python对象   
                                                         #loads操作的是字符串、load操作的是文件流                   
    result.foreach(print)

2. RDD operator: two types

Transformation operation --> is used to describe the dependency relationship between RDDs , lazy operation is just to transform RDD into a new RDD, receive an RDD, output an RDD.
Action operation --> Action operation or trigger operation, and generate the result . Receive an RDD and output a non-RDD.

Commonly used conversion operations (Transformation API) Transfermation operator

textFile(): read data
map(func): Pass each element to the function func, and return the result as a new data set
reduceByKey(func): When applied to a data set of (K, V) key-value pairs, it returns a new (K, V) form of data set, where each value is passed to the function func for each key Aggregation (hello, 1), (hello, 1), (hello, 1) -----> (hello, 3)
groupByKey(): When applied to the data set of (K,V) key-value pairs, it returns a new (K, Iterable) data set hello,word,hello,beijing----->(hello,1) ,(Hello,1),(word,1),(beijing,1)------>(hello,1,1)(word,1)(beijing,1)
keys()
values()
sorteByKey ()
mapValues()
join()
filter(func): Filter out the elements that satisfy the function func, and return a new data set
flatMap(func): similar to map(), but each input element can be mapped to 0 or more output results

Commonly used action operations (Action API)

count() returns the number of elements in the data set
collect() returns all elements in the data set in the form of an array
first() returns the first element in the data set
take(n) returns the first n elements in the data set in the form of an array
reduce(func) aggregates the elements in the data set through the function func (input two parameters and return a value)
foreach(func) passes each element in the data set to the function func to run

Three, RDD application

data set:

1 hello word
2 hello beijinhg
3 hello taiyuan

1. Create Pair RDD

    What is Pair RDD?
      The RDD that contains the key-vaule key-value pair type is called Pair RDD.
      Pair RDD is usually used to aggregate data.
      Pair RDD is converted from ordinary RDD, for example:
      construct a Pair RDD through map

from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext
 
    #案例1：
     context = sc.textFile("D:/WorkSpace/tylg2020/resources/localfile/test.txt")
     
     #将每一行内容作为一个元素放到数组里
     #----->['1 hello word','2 hello beijing','3 hello taiyuan'] 
     print(context.collect())        
     
     #将每一行的行数作为key值，每行的数据作为value，构建pairRDD
     #----->(1,1 hello word)(2,2 hello beijing)(3,3 hello taiyuan)
     pairRDD = context.map(lambda line:(line.split(" ")[0],line))
     
     print(pairRDD.collect())

data set:

hello 
word
hello 
beijinhg
hello 
taiyuan

from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext

    #案例2：
    context = sc.textFile("D:/WorkSpace/tylg2020/resources/localfile/word.txt")
    
    #将每一行的行数作为key值，每行的数据作为value，构建pairRDD
    #[（hello，1）（ word，1）（ hello，1）（ beijing，1） （ hello，1）（taiyuan，1）]
    #stage1 窄依赖  
    pairRDD = context.map(lambda word:(word,1))
    #stage2 宽依赖  就是让上一步具体执行
    tmp_list = pairRDD.collect()
    
    print(tmp_list)
    
    #迭代函数，把tmp_list里的元素依次打印
    def fun(x):
       print(x)
    [ fun(i) for i in tmp_list]

2.reduceByKey()

It is an operator that uses a correlation function to merge the vlaues of the same key value, and counts according to the key value.

例如统计"a"的个数：("a", 1), ("b", 1), ("a", 1),("a", 1), ("b", 1), ("b", 1),("b", 1), ("b", 1)

reduceByKey((pre, after) => (pre + after)

The first step 1 , 1 1 + 1 2

The second step 2 , 1 2 + 1 3

The second step is to get ("a",3)

from pyspark.sql import SparkSession
if __name__ == "__main__":
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext
    """
    reduceByKey()
     是使用一个相关函数来合并相同key值的vlaues的一个算子
    """

    # #案例1
    nums= sc.parallelize(((1, 2), (3, 4), (3, 6)))
    sumCount = nums.reduceByKey(lambda x, y: x + y)
    #------->[(1,2),(3,10)]
    print(sumCount.collect())


    #案例2

ls=["java","python","java","java","bigdata","python","java","C","C","python","bigdata","java"]
    data = sc.parallelize(ls)
    
    #第一步：（java，1）（python，1）。。。。。
    pairRDD= data.map(lambda x:(x,1))
    #a,b表示pairRDD元素的VALUE
    wordCount=pairRDD.reduceByKey(lambda a,b:a+b)
    
    #第二步：（java，5）（python，1）。。。。
    rsList=wordCount.collect()
    print(rsList)
    
    def fun(x):
      print(x)
    [fun(i) for i in rsList]

3.groupByKey() is grouped by key value

from pyspark.sql import SparkSession
if __name__ == "__main__":
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext
    """
    groupByKey() 是按照key值进行分组,例如：
    ls = ["java", "python", "java", "java", "bigdata", "python", "java", "C", "C", "python", "bigdata", "java"]
    按key分组，结果是下面形式
    [("java",[1, 1, 1, 1, 1])，("bigdata",[1, 1])，("python",[1, 1])，("C",[1, 1])]
    """
    ls = ["java", "pthon", "java", "java", "bigdata", "python", "java", "C", "C", "python", "bigdata", "java"]
    data = sc.parallelize(ls)
    pairRDD = data.map(lambda x: (x, 1))

    groupRdd = pairRDD.groupByKey()
    tmp_list = groupRdd.collect()
    print(tmp_list)
    print(len(tmp_list))


    def fun(x):       #x=("java",[1, 1, 1, 1, 1])
        print(x[0])
        result = list(x[1])
        print(result)
        for i in result:
            print(i)
        print("*" * 60)


    [fun(i) for i in tmp_list]

4.Keys()与values()

Used to retrieve all key values in rdd and store them in a collection

Used to retrieve all value values in rdd and store them in a collection

from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext
    """
    keys() 是取rdd中所有的key值，存储到一个列表中
    Values() 是取rdd中所有的value值，存储到一个列表中
    ls = ["java", "pthon", "java", "java", "bigdata", "python", "java", "C", "C", "python", "bigdata", "java"]
    """
    #案例
    ls = ["java", "python", "java", "java", "bigdata", "python", "java", "C", "C", "python", "bigdata", "java"]
    #构建paiRDD
    data = sc.parallelize(ls)
    pairRDD = data.map(lambda word: (word, 1))
    #reduceByKey合并统计
    newPairRDD = pairRDD.reduceByKey(lambda x, y: x + y)
    print(newPairRDD.collect())
    
    keysRDD= newPairRDD.keys()        #取key
    valuesRDD=newPairRDD.values()     #取value
    
    keys_list =keysRDD.collect()
    values_list =valuesRDD.collect()
    print(keys_list)
    print(values_list)

5.mapValues

from pyspark.sql import SparkSession
if __name__ == "__main__":
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext
    """
    mapValues() 是是对所有的value进行操作

    ls = ["java", "pthon", "java", "java", "bigdata", "python", "java", "C", "C", "python", "bigdata", "java"]
    """
    ls = ["java", "pthon", "java", "java", "bigdata", "python", "java", "C", "C", "python", "bigdata", "java"]
    data = sc.parallelize(ls)
    pairRDD = data.map(lambda word: (word, 1))
    reduceRDD=pairRDD.reduceByKey(lambda x,y:x+y)
    print(reduceRDD.collect())

    #把每个value值都加上100
    mapValuesRDD = reduceRDD.mapValues(lambda a: 100 + a)
    rs_list = mapValuesRDD.collect()
    print(rs_list)

6.sortByKey()

from pyspark.sql import SparkSession
if __name__ == "__main__":
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext
    """
    sortByKey() 是对所有数据按key值进行排序
    ls = ["java", "pthon", "java", "java", "bigdata", "python", "java", "C", "C", "python", "bigdata", "java"]
    """
    ls = ["java", "pthon", "java", "java", "bigdata", "python", "java", "C", "C", "python", "bigdata", "java"]
    data = sc.parallelize(ls)
    pairRDD = data.map(lambda word: (word, 1))
    reduceRDD=pairRDD.reduceByKey(lambda x,y:x+y)
    print(reduceRDD.collect())
    
    #按key值排序，字母表顺序
    sorteRDD =reduceRDD.sortByKey()
    rs_list=sorteRDD.collect()
    print(rs_list)

7.join()

from pyspark.sql import SparkSession
if __name__ == "__main__":
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext
    """
    join() 
    """
    ls1 = ["java", "python", "java", "java", "bigdata", "python", "java", "C"]
    ls2 = ["C", "python", "C", "java", "bigdata", "python", "bigdata", "C"]

    rdd1= sc.parallelize(ls1)
    rdd2= sc.parallelize(ls2)
    pairRDD1 = rdd1.map(lambda word:(word,1))
    pairRDD2 = rdd2.map(lambda word:(word,1))
    print(pairRDD1.collect())
    print(pairRDD2.collect())

    #合并两个元组  ls1中每一个和ls2中的每一个依次合并一次（java,(1，1))_
    newPairRDD =pairRDD1.join(pairRDD2)

    rs_list =newPairRDD.collect()
    print(rs_list)
    print(len(rs_list))

8. Endurance

Reasons for using persistence:

from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext
   
    ls = ["java", "python", "java", "java", "bigdata", "python", "java", "C", "C", "python", "bigdata", "java"]

    data = sc.parallelize(ls)
    pairRDD = data.map(lambda word: (word, 1))
    newPairRDD = pairRDD.reduceByKey(lambda x, y: x + y)
    print(newPairRDD.collect())
    
    keysRDD= newPairRDD.keys()      #这里的两个newpariRDD都是相同的，但每使用一次都要从sc开始从头调用，
                                    #也就是重复从开始依次调用了两次，造成了时间浪费，所以要将newpairRDD持久化
    valuesRDD=newPairRDD.values()    
    
    keys_list =keysRDD.collect()
    values_list =valuesRDD.collect()
    print(keys_list)
    print(values_list)

c Persistence case:

As above, spark will perform two calculations from beginning to end. Can avoid this kind of double calculation overhead through persistence (caching) mechanism. You can use the persist() method to mark an RDD as persistent . The reason why it is marked as persistent is because where the persist() statement appears, the RDD will not be generated and persisted immediately , but will be waited until encountering the first real action action triggers calculation after only will the results be persistent, RDD after persistence will be retained in memory compute nodes , is behind the action repeated use.

from pyspark.sql import SparkSession
if __name__ == "__main__":
    """
    使用持久化的原因减少数据的多次读取。
    减少之前newpairRDD的多次调用，所以将pairRDD存起来，进行持久化
    """
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext

    textFile=sc.textFile("D:/WorkSpace/tylg2020/resources/localfile/persit.txt")
    #给textFile持久化
    #persist（）
    filterRDD= textFile.filter(lambda line:"Spark" in line)

    counter =filterRDD.count()
    first_line =textFile.first()

persist(): Persistence level parameter

the persist (MEMORY_ONLY) : shows a RDD deserialize objects as memory storage in the JVM , if insufficient memory , the data might not be persistent . Then the next time you perform an operator operation on this RDD, the data that has not been persisted needs to be recalculated from the source. This is the default persistence strategy. When the cache() method is used, persist(MEMORY_ONLY) is actually used.

persist(MEMORY_AND_DISK) : Indicates that the RDD is stored as a deserialized object in the JVM. If the memory is insufficient , the excess partition will be stored on the hard disk . The next time the operator is executed on this RDD, the data persisted in the disk file will be read out

persist(MEMORY_ONLY_SER) : The data in the RDD will be serialized, and each partition of the RDD will be serialized into a byte array . This method saves more memory , which can prevent persistent data from occupying too much memory and causing frequent GC.

persist(MEMORY_AND_DISK_SER) : The only difference is that the data in the RDD will be serialized, and each partition of the RDD will be serialized into a byte array . Stored in the hard disk , which can avoid the persistent data occupying too much memory and causing frequent GC.

persist(DISK_ONLY): write all the data to the disk file. MEMORY_ONLY_2, MEMORY_AND_DISK_2. For any of the above-mentioned persistence strategies, if the suffix _2 is added, it means that a copy of each persisted data is made and the copy is saved on other nodes. This copy-based persistence mechanism is mainly used for fault tolerance. If a node fails and the persistent data in the memory or disk of the node is lost, then the copy of the data on other nodes can also be used in the subsequent calculation of the RDD . If there is no copy, the data can only be recalculated from the source.

9. Partition (optimization method)

RDD is a flexible distributed data set. Usually the RDD is very large and will be divided into many partitions. One partition corresponds to one task, which is stored on different nodes. One of the partitioning principles of RDD partitioning is to make the number of partitions equal to 2 to 3 times the number of CPU cores in the cluster . Generally speaking:

7 machines * 3 cores = 21 cores = 21 partitions

Local mode: The default is the number of CPUs of the local machine, if local[N] is set, the default is N
Apache Mesos: The default number of partitions is 8
Standalone or YARN: Choose the larger value of the " sum of all CPU cores in the cluster " and " 2 " as the default value

tmp_lisy = [1,2,3,4,5]

rdd = sc.parallelize(tmp_lisy,2) #设置两个分区

For different Spark deployment modes (local mode, Standalone mode, YARN mode, Mesos mode), you can configure the default partition by setting the value of the parameter spark.default.parallelism (conf/spark-default.conf) number
For different data sources, the default number of anger is different, so to optimize spark, you should get the number of partitions on different platforms
spark.default.parallelism is only effective for processing RDD , not spark SQL
For parallelize, if the number of partitions is not specified in the method, the default is spark.default.parallelism. If specified, take: min(defaultParallelism,2)
If the file is read from HDFS, the number of partitions is the number of file fragments (for example, hadoop2.x 128MB/slice)
Too many partitions is not good, which means there are many tasks waiting at a certain moment

10. Shared variables: (optimization method)

By default, if the same function is run on multiple tasks on multiple different nodes of the spark cluster, it will generate a copy of each variable involved in the function for each character. But sometimes, you need to share variables between multiple tasks or share variables between tasks and task control nodes. In order to meet these two requirements, spark provides two types of variables:

Broadcast variables : share variables in the memory of all nodes

Usage scenarios : tasks in the case of cross-machine, cross-phase and other parallel computing require the same data, or data that is cached without serialization

               After the definition, the variable without principle is used. The broadcast variable
               can be used between multiple tasks and between different servers.
               Avoid
               using value for multiple server data repeatedly . In the following case, ls can no longer be used after this, only broadcastRDD can be used to avoid repeated ls distribution in the cluster.

from pyspark.sql import SparkSession
if __name__ == "__main__":
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext

    """
    广播变量：用于每个计算节点上多个任务可以共享的变量
    """
    ls =[1,2,3,4,5,6,7,8,9]
    """
    生成广播变量用广播方法broadcast()
    """
    broadcastRDD =sc.broadcast(ls)
    
    #直接通过广播变量取出值
    print(broadcastRDD.value)

Accumulator : Supports accumulation calculation between all different nodes

              Use the context variable sparkContext object sc to call the accumulator method to create an accumulator.
              Use the accumulator object to call the add method to accumulate the data.
              The accumulator can be used between different tasks and between different servers.
              Use value

from pyspark.sql import SparkSession
if __name__ == "__main__":
    spark = SparkSession.builder.appName("wordcount").master("local[2]").getOrCreate()
    sc = spark.sparkContext
    """
     累加器：SparkContext.accumulator()来创建，使用add方法来把数值累加到累加器上，
     使用value方法来读取累加器的值
    """
    ls = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    lsRDD=sc.parallelize(ls)
    
    #创建累加器对象
    addRDD=sc.accumulator(0)
    
    #addRDD调用add()时间累加
    lsRDD.foreach(lambda x:addRDD.add(x))
    
    print(addRDD.value)

[Big Data] RDD Programming