Spark learns from 0 to 1 (1)-Introduction to Apache Spark

1. Initial Spark

1.1 What is Spark

Apache Spark is a fast and versatile computing engine designed for large-scale data processing. Spark is a general-purpose computing framework similar to Hadoop MapReduce that is open sourced by UC Berkeley AMP lab (University of California, Berkeley's AMP lab). Spark has the advantages of Hadoop MapReduce; the only difference from MapReduce is that the intermediate results of Job can be stored in memory without the need to read and write DHFS. Therefore, Spark is better applicable to MapReduce algorithms that require iteration such as data mining and machine learning.

1.2 The difference between Spark and MapReduce

  1. Are distributed computing frameworks
  2. Spark is based on memory, MR is based on HDFS
  3. Spark processing data capacity is generally more than ten times that of MR
  4. Spark divides the execution order of tasks based on DAG directed acyclic graph

1.3 Spark operating mode

  • Local

    It is mostly used for local testing, such as writing program testing in eclipse and idea.

  • Standalone

    Standalone is a resource scheduling framework that comes with Spark, which supports fully distributed.

  • Yarn

    The resource scheduling framework in the Hadoop ecosystem, Spark can also be calculated based on Yarn.

  • Months

    Resource scheduling framework

To perform resource scheduling based on Yarn, the ApplicationMaster interface must be implemented. Spark implements this interface and can be based on Yarn.

2. SparkCore

2.1 RDD

2.1.1 Concept

RDD (Resilient Distributed Dateset), a flexible distributed data set.

2.1.2 Five characteristics of RDD

  • RDD is composed of a series of partitions.
  • The function is applied to each partition.
  • There are a series of dependencies between RDDs.
  • The partitioner works on RDD in K, V format.
  • RDD provides a series of optimal calculation positions.

2.1.3 RDD diagram

Insert picture description here

Looking from the top down:

  1. The bottom layer of the textFile method encapsulates the way that MR reads the file. Split before reading the file. The default split size is a block size.

  2. RDD does not actually store data. For the convenience of understanding, it is temporarily understood as storing data.

  3. What is RDD in K and V format?

    If the data stored in the RDD are all binary objects, then this RDD is called an RDD in K, V format.

  4. Where is the resilience/fault tolerance of RDD reflected?

    The number of patitions and the size of RDDs do not limit the flexibility of the RDD system.

    The dependency between RDDs can be recalculated based on the previous RDD.

  5. Where does the distributed RDD reflect?

    RDD is composed of partitions, which are distributed on different nodes.

  6. RDD provides the best location for calculation and reflects data localization. The concept of "computing mobile data not moving" in big data is structured.

2.2 Spark task execution principle

Insert picture description here

The four machine nodes in the above figure, Driver and Worker are processes started on the nodes and run in the JVM.

  1. Frequent communication between Driver and cluster nodes.
  2. Driver is responsible for task distribution and recycling results, task scheduling. If the calculation result of the task is very large, there is no need to recycle it, which will cause OOM.
  3. Worker is the slave node of resource management in the Standalone resource scheduling framework. It is also a JVM process.
  4. Master is the master node of resource management in the Standalone resource scheduling framework. It is also a JVM process.

2.3 Spark code flow

  1. Create SparkConf object

    Can set appName

    You can set the operating mode and resource requirements

  2. Create a SparkContext object.

  3. Create RDD based on Spark context and process RDD.

  4. The application must have an Action class operator to trigger the execution of the Transformation class operator.

  5. Close the Spark context object SparkContext.

2.4 Transformations conversion operator

2.4.1 Concept

Transformations operator is a type of operator (function), called transformation operator. Such as: map, flatMap, reduceByKey, etc. Transformations operator delayed execution, also called lazy loading execution.

2.4.2 Transformation class operator

Transformations算子 effect
map(func) Return a new distributed data set, where each element is converted from an element in the source RDD through the func function
filter (func) Return a new data set, which contains elements from the result of filtering the elements in the source RDD through the func function (the result of the func function returning true)
flatMap(func) Similar to map, but each element can be mapped to 0 to n output elements (the func function must return a sequence (Seq) rather than a single element)
mapPartitions (func) Similar to map, but it is based on each partition (or data block) of the RDD to run independently, so if the RDD contains element type T, the func function must be a mapping function of Iterator => Iterator
mapPartitionsWithIndex (func) Similar to mapPartitions, except that func has one more integer partition index, so RDD contains element type T, then func function must be a mapping function of Iterator => Iterator
mapWith(func1,func2) mapWith is another variant of map, map only needs one input function, and mapWith has two input functions. The first function takes the partion index of the RDD (index starts from 0) as input, and the output is the new type A; the second function f takes the two-tuple (T, A) as input (where T is the element in the original RDD, A is the output of the first function), the output type is U
reduceByKey(func,[numTasks]) If the source RDD contains a (K, V) pair of element types, the operator also returns an RDD containing (K, V) pairs, except that the value corresponding to each key is the result of the aggregation of the func function, and the func function It is a mapping function of (V,V) => V. In addition, and groupByKey type. The number of reduce tasks can be specified by the optional parameter numTasks
aggregateByKey(zeroValue,seqOp,combOp,[numTasks]) If the source RDD contains (K,V) pairs, the returned new RDD contains (K,V) pairs, where the Value corresponding to each Key is aggregated by the combOp function and a "0" value zeroValue. After running the aggregation, the Value type is different from the input Value type, avoiding unnecessary overhead. Similar to groupByKey. The number of reducer tasks can be specified by the optional parameter numTasks
sortByKey([ascending],[numTasks]) If the source RDD contains a pair of elements (K, V), and K can be sorted, the new RDD containing (K, V) pairs is returned and sorted by K (ascending parameter determines whether it is ascending or descending)
sortBy(func,[ascending],[numTasks]) Similar to sortByKey, except that sortByKey can only be sorted by Key, and sortBy is more flexible, and can be sorted by key or value.
randomSplit(Array[Double],Long) This function divides an RDD into multiple RDDs according to weights. The weight parameter is a Double array, and the second parameter is the seed of random, which can basically be ignored
glom() This function converts the elements of type T in each partition of the RDD into Array[T], so that each partition has only one array element.
zip(otherDataSet) It is used to combine two RDDs into RDDs in the form of K and V. By default, the number of partitions and the number of elements of the two RDDs are the same, otherwise an exception will be thrown.
partitionBy (Partitions) This function generates a new ShuffleRDD according to the partitioner function, and repartitions the original RDD.
join(otherDataset,[numTasks]) It is equivalent to the INNER JOIN of mysql. It returns only when the data sets on the left and right sides of the join exist. The number of partitions after join is the same as the one with more parent RDD partitions
leftOuterJoin(otherDataset) Equivalent to mysql's LEFT JOIN, leftOuterJoin returns all the data on the left side of the data set and the data that has an intersection on the left and right sides of the data set, and non-existent data is filled with None
rightOuterJoin(otherDataset) Equivalent to MySQL's RIGHT JOIN, rightOuterJoin returns all the data on the right side of the data set and the data where the right side and the left side of the data set intersect, and non-existent data is filled with None
fullOuterJoin(otherDataset) Return all the data of the left and right data sets, and fill in the data that does not exist on the left and right sides with None
union Combine the two data sets. The types of the two data sets must be consistent. The number of partitions of the new RDD returned is the sum of the number of merged RDD partitions.
intersection Take the intersection of the two data sets, and return the new RDD that has more partitions consistent with the parent RDD
subtract Take the difference of the two data sets, and the number of partitions of the result RDD is the same as the number of partitions of the RDD before subtract
distinct Deduplication, equivalent to (map+reduceByKey+map)
cogroup When calling data of types (K, V) and (K, W), a data set (K, (Iterable, Iterable)) is returned, and the partition of the child RDD is more consistent with the parent RDD.

2.5 Action operator

2.5.1 Concept

Action operator is also a kind of operator (function), called action operator, such as foreach. collect, count, etc. Transformations operator is delayed execution, Action operator is triggered execution. There are several Action class operators executed in an APP application, and several jobs are running.

2.5.2 Action operator

Action operator effect
reduce(func) 将RDD中元素按func函数进行聚合,func函数是一个(T,T) ==> T 的映射函数,其中T为源RDD的元素类型,并且func需要满足交换律和结合律以便支持并行计算
collect() 将数据集集中,所有元素以数组形式返回驱动器(driver)程序。通常用于在RDD进行了filter或其他过滤后,将足够小的数据子集返回到驱动器内存中,否则会OOM。
count() 返回数据集中元素个数
first() 返回数据中首个元素(类似于take(1))
take(n) 返回数据集中前n个元素
takeSample(withReplacement,num,[seed]) 返回数据集的随机采样子集,最多包含num个元素,withReplacement表示是否使用回置采样,最后一个参数为可选参数seed,随机数生成器的种子。
takeOrderd(n,[ordering]) 按元素排序(可以通过ordering自定义排序规则)后,返回前n个元素
foreach(func) 循环遍历数据集中的每个元素,运行相应的逻辑
foreachParition(func) foreachParition和foreach类似,只不过是对每个分区使用函数,性能比foreach要高,推荐使用。

2.6 控制算子

2.6.1 概念

控制算子有三种:cache、persist、checkpoint。以上算子都是可以将RDD持久化,持久化单位是partition。cache和persist都是懒执行的,必须有一个action 类算子触发执行。checkpoint算子不仅能将RDD持久化到磁盘,还能切断RDD之间的依赖关系。

2.6.2 控制算子介绍

  1. cache

    默认将RDD的数据持久化到内存中。cache是懒执行。

    cache() = persist(StorageLevel.MEMORY_ONLY());
    
  2. persist

    可以指定持久化的级别。最常用的是StorageLevel.MEMORY_ONLY()StorageLevel.MEMORY_AND_DISK()。"_2"表示有副本。

    持久化级别如下:

    def useDisk : scala.Boolean = {
          
           /* compiled code */ }
    def useMemory : scala.Boolean = {
          
           /* compiled code */ }
    def useOffHeap : scala.Boolean = {
          
           /* compiled code */ }
    def deserialized : scala.Boolean = {
          
           /* compiled code */ }
    def replication : scala.Int = {
          
           /* compiled code */ }
    
    持久化级别 作用
    NONE 不做持久化
    DISK_ONLY 只持久化到磁盘
    DISK_ONLY_2 只持久化到磁盘,并且有2个副本
    MEMORY_ONLY Only persist to memory
    MEMORY_ONLY_2 Only persist to memory, and have 2 copies
    MEMORY_ONLY_SER Only persist to memory and serialize
    MEMORY_ONLY_SER_2 Only persist to memory, store 2 copies, and serialize
    MEMORY_AND_DISK Persist to memory and disk, and store to disk when memory is insufficient
    MEMORY_AND_DISK_2 Persist to memory and disk, store to disk when memory is not enough, and have 2 copies
    MEMORY_AND_DISK_SER Persist to memory and disk, store to disk when memory is insufficient, and serialize
    MEMORY_AND_DISK_SER_2 Persist to memory and disk, store to disk when memory is not enough, and have 2 copies, and serialize
    OFF_HEAP Persist to off-heap memory

    Cautions for cache and persist:

    1. Both cache and persist are executed lazily, and there must be an action class operator to trigger execution.
    2. The return value of the cache and persist operators can be assigned to a variable. To use this variable directly in other jobs is to use persistent data. The unit of persistence is partition.
    3. The cache and persist operators cannot immediately follow the action operator.
    4. The persistent data of the cache and persist operators will be cleared after the APP is executed.

    Error: The deployed persistent RDD returned by rdd.cache().count() is a numeric value.

  3. checkpoint

    Checkpoint persists RDDs to disk, and can also cut off the dependency between RDDs. The checkpoint directory data will not be cleared after the APP is executed.

    The execution principle of checkpoint:

    1. When the RDD job is executed, it will backtrack from the finalRDD back to the front.
    2. When the checkpoint method is called for a certain RDD, a mark will be made on the current RDD.
    3. The Spark framework will automatically start a new job and recalculate the data of this RDD. Persist data to HDFS.

    Optimization: Before performing checkpoint on the RDD, it is best to execute the cache on this RDD first, so that the newly started job only needs to copy the data in the memory to HDFS, eliminating the need to recalculate the step.

    use:

    SparkSession spark = SparkSession.builder()
    				.appName("JavaLogQuery").master("local").getOrCreate();
    JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
    JavaRDD<String> dataSet = sc.parallelize(exampleApacheLogs);
    dataSet = dataSet.cache();
    sc.setCheckpointDir("/checkpoint/dir");
    dataSet.checkpoint();
    
  4. unpersist

    Delete the data persisted to memory and disk.

Guess you like

Origin blog.csdn.net/dwjf321/article/details/109047486