Spark distributed execution components involved
Spark each application by a driver to initiate various parallel operations by a driver on a cluster SparkContext Object Access Spark; managing a plurality of driver actuator nodes , can be created by SparkContext RDD.
RDD (Resilient Distributed Dataset: Flexible distributed data sets)
RDD Features
- Spark, all operations on the data of no more than: Create RDD, has been transformed RDD, call RDD operation is evaluated.
- Spark distribution data automatically to the cluster in the RDD, and the operation of the parallel execution.
- Spark the RDD is a distributed set of immutable objects
RDD two ways to create methods
- Reading a set of external data. Such as
sc.textFile("readme.md")
- Distribute driver in the driver in the collection of objects (such as list, set). A program that is passed to the existing set of SparkContext Parallelize () method. This method is not much, because of the need to put the entire data set in memory of a machine.
RDD supports two types of operations - one: conversion action (transform)
- Returns a new RDD operation.
- Many conversion action is for, but not like this all the various elements of the conversion operations.
- Common conversion actions
filter()
: receiving a function, the function element and the composition satisfies the new RDD RDD. - Common conversion actions
map()
: receiving a function, the function is applied to each element of RDD, composed of all the function returns a new RDD. - There are some pseudo-collection operation: RDD most frequently missing is a collection of unique properties of the element. It can be used
RDD.distinct()
to generate a new RDD contains only different elements. But a large distinct overhead because all the data need to go through the network shuffling (shuffle).
RDD supports two types of operations - II: Operation Action (action)
- To the driver program returns the result or the result of the write operation of the external system, will trigger the actual calculation. By default, Spark of RDD will be recalculated each time the action they are operating. If you want to reuse the same RDD, you can use multiple operating action
RDD.persist()
to the RDD cached (persistence). - There is a RDD
collect()
can be used to obtain the data in the entire RDD, but this requires RDD data is smaller. - Common Action functions
reduce()
: a receiving function as a parameter, the function to operate the same type of two RDD data and return a new element of the same type. A simple example is the function "+." - such as
count()
RDD supports two types of operations - related
- The difference between the conversion operations and actions to: calculate RDD different ways: Spark only lazy evaluation RDD : namely conversion action and operation of all conversion operations only used for the first time in an action operation, the calculation will really relate to.
- Lazy evaluation: "We should not be seen as RDD store the data of a particular data set, and preferably each of the RDD as we build out by the conversion operation, recording data of how to calculate the instruction list "