[Spark rapid big data analytics] Spark basis

Spark distributed execution components involved

Spark each application by a driver to initiate various parallel operations by a driver on a cluster SparkContext Object Access Spark; managing a plurality of driver actuator nodes , can be created by SparkContext RDD.

RDD (Resilient Distributed Dataset: Flexible distributed data sets)

RDD Features

  • Spark, all operations on the data of no more than: Create RDD, has been transformed RDD, call RDD operation is evaluated.
  • Spark distribution data automatically to the cluster in the RDD, and the operation of the parallel execution.
  • Spark the RDD is a distributed set of immutable objects

RDD two ways to create methods

  • Reading a set of external data. Such assc.textFile("readme.md")
  • Distribute driver in the driver in the collection of objects (such as list, set). A program that is passed to the existing set of SparkContext Parallelize () method. This method is not much, because of the need to put the entire data set in memory of a machine.

RDD supports two types of operations - one: conversion action (transform)

  • Returns a new RDD operation.
  • Many conversion action is for, but not like this all the various elements of the conversion operations.
  • Common conversion actions filter(): receiving a function, the function element and the composition satisfies the new RDD RDD.
  • Common conversion actions map(): receiving a function, the function is applied to each element of RDD, composed of all the function returns a new RDD.
  • There are some pseudo-collection operation: RDD most frequently missing is a collection of unique properties of the element. It can be used RDD.distinct()to generate a new RDD contains only different elements. But a large distinct overhead because all the data need to go through the network shuffling (shuffle).

RDD supports two types of operations - II: Operation Action (action)

  • To the driver program returns the result or the result of the write operation of the external system, will trigger the actual calculation. By default, Spark of RDD will be recalculated each time the action they are operating. If you want to reuse the same RDD, you can use multiple operating action RDD.persist()to the RDD cached (persistence).
  • There is a RDD collect()can be used to obtain the data in the entire RDD, but this requires RDD data is smaller.
  • Common Action functions reduce(): a receiving function as a parameter, the function to operate the same type of two RDD data and return a new element of the same type. A simple example is the function "+."
  • such ascount()

RDD supports two types of operations - related

  • The difference between the conversion operations and actions to: calculate RDD different ways: Spark only lazy evaluation RDD : namely conversion action and operation of all conversion operations only used for the first time in an action operation, the calculation will really relate to.
  • Lazy evaluation: "We should not be seen as RDD store the data of a particular data set, and preferably each of the RDD as we build out by the conversion operation, recording data of how to calculate the instruction list "

Guess you like

Origin www.cnblogs.com/coding-gaga/p/11443982.html