Spark core RDD (on)

Spark core RDD (on)

First, prior knowledge:

x: RDD, partitions, pipelining, Stage ...

0, RDD in terms of iteration times faster than Hadoop 20, performance data analysis calculated based report increased 40 times, but also can interactively query 1TB of data sets within 5-7 seconds. RDD applicable to batch conversion application requirements, and the same operation applied to each element of the data set.

1, RDD persistence: First, persist () internal calls persist (StorageLevel.MEMORY_ONLY); two, cache () call persist (). Description : The RDD persistence method does not cache immediately when called, but the trigger behind the action, the RDD will be cached in memory compute nodes, and for later reuse.

2, RDD serializable: each partition is just an array of bytes it, greatly reducing the number of objects, and reduces the memory footprint

. 3, S Park Core Production select only MEMORY_ONLY (non-serialized Java object format) and MEMORY_ONLY_SER (the data in the RDD serialize) of the two memory levels, Cpu the calculation speed is much larger than the data read from the memory greater speed in reading data from the disk speed , the data may recalculate faster than the speed of the read cache data, it is unnecessary to store the data or parts of multiple storage disks, so that the production of spark core select only these two types of storage

4, a sequence of such that occupies less memory, but the serialization and deserialization requires time-consuming, while the CPU cost, it is particularly to be used to select a cache based on the resource and service, is a good practice to release the end of good practices cached

5, a data flow system (model acyclic) in the iterative algorithm and interactive data mining tools processing in both cases is not efficient. Characteristics of these two cases is to reuse data set between a plurality of work operations in parallel . Iterative algorithms every step of the execution of the function similar to the data, and machine learning algorithms required after this iteration of excellent weight transfer resulting data set as input the next iteration, MapReduce framework to calculate the results of the output data is written back after a Reduce operation disk speed is greatly reduced, RDD data stored in memory can greatly improve performance.

6, Transformation is actually a logical chain of Action , chronicles the evolution of the RDD. Action is the essence trigger action Transformation counted, as has been recorded in the course of each Transformation in, so each RDD is on a RDD know how to change the current state, so if the error can be easily re-interpretation calculation process. 

7, based on the data set of hadoop of MapReduce and based on the working set of the RDD Spark a common denominator : a location-aware scheduling ( MapReduce : Partition, the Reducer; S Park: Partition, the Stage ) - Optimal scheduling, to improve the system performance, fault tolerant automatic - efficient fault tolerance, fault-tolerant saving costs, load balancing , etc.

9, RDD also allows a user key (key) in accordance with the specified order of the partitions (fine-grained operation) , which is an optional feature. Currently supports hash partitioning and range partitioning , such as: application requests two RDD partitioned in the same way hash partitioning (having the same keywords recorded on the same machine in a partition, to accelerate join therebetween operation. some operations will automatically generate a hash or RDD range partitioning, like groupByKey, reduceByKey and sort etc. partition replacement strategy optimization!

10, JVM memory overflow of OOM:

11, MapReduce other fault-tolerant characteristics of the data flow model: ; RDD tolerant features: RDD provide a highly restricted shared memory, i.e. RDD is read-only , and only perform the switching operation determined by other RDD ( such as map, join and by Group ) to create.

Two, RDD flexibility:

1, the elastic one: automatically switching data memory and disk storage;

2, the elastic bis: Lineage efficient based fault tolerance (errors of n nodes, will recover from the fault-tolerant origin node n-1); RDD support only coarse-grained conversion , i.e., in a single operation performed on a large number of records . RDD creates a series of conversion ( such as map, join and group by) recorded (i.e. Lineage). If the chain is very long Lineage, Lineage chain may be stored in the physical storage, and then periodically performs RDD checkpoint .

3, the elastic ter: Task failure occurs automatically if a certain number of retries (default 4);

4, the elastic Four: Stage If the failure can be a certain number of retries (which can only run failed stage calculation) automatically; calculated only failed data pieces;

5, checkpoint and persist: checkpoint: each of the RDD generate new operation will RDD , if the chain is long, cumbersome calculation, put the data on the hard disk; persist: memory or disk data multiplexing

6, elastic data scheduling: DAG TASK and resource management has nothing to do

7, the height of the elastic data slice (slice set consisting of artificial function), repartition

Three, RDD's Transformation and Actions :

1, transformation specific content: RDD can only be executed based on a stable set of data on physical storage and other existing RDD deterministic operation to create. The deterministic operation called conversion ( Transformation )
  • map (func): Returns a new set of data distributed by each element of the original composition after conversion function func
  • filter (func): Returns a new data set after a func function returns true original elements
    * flatMap (func): similar to the map, but each input element is mapped to a plurality of output 0 element (Thus, the return value is a function func Seq, rather than a single element)
  • flatMap (func): similar to the map, but each input element is mapped to a plurality of output elements is 0 (and therefore, the return value is a function func Seq, rather than a single element)
  • sample (withReplacement, frac, seed) :
    The given random seed seed, a random sampling of data the number of frac
  • union (otherDataset): Returns a new set of data, the original data set and the joint from the parameter
  • groupByKey ([numTasks]):
    one called by the (K, V) consisting of data sets, a return (K, Seq [V]) of the data set. Note: By default, using eight parallel tasks are grouped, you can pass numTask optional parameter, the amount of data provided a different number of Task
  • reduceByKey (func, [numTasks]): in a (K, V) of the dataset, returns a (K, V) data set, the same value as the key, are specified reduce function aggregated together . And groupbykey Similarly, the number of tasks that may be configured by an optional second parameter.
  • join (otherDataset, [numTasks]) :
    on the type (K, V), and (K, W) call type data set, returns a (K, (V, W) ) pairs, all elements of each key all together datasets
  • groupWith (otherDataset, [numTasks]): the type (K, V), and (K, W) call type data set on, returns a data set, consisting of elements (K, Seq [V], Seq [W] ) Tuples. This operation other frames, referred CoGroup
  • cartesian (otherDataset): Cartesian product. But when called on a data set T and U, a return (T, the U-) of the data set, all elements interactions Cartesian product.
  • flatMap (func):
    similar to the map, but each input element is mapped to a plurality of output elements is 0 (and therefore, the return value is a function func Seq, rather than a single element)
2, Actions specific content: Start calculation operation, and return a value to the user program or export the data to the storage system
    • reduce (func): All the elements by the function func aggregated data set. Func function takes two parameters and returns a value. This function must be relevance, ensure the correct concurrent execution
    • collect (): Driver program in the form of an array and returns all the elements of the data set. This usually after use or other filter operations, small enough to return a subset of data re-use, directly Collect the whole set RDD returned, it may make the OOM Driver program, batch read operation, i.e. scanning the entire data set, assigned to recent data from the node
    • count (): returns the number of elements in the data set, the batch read operation, i.e. scanning the entire data set, assigned to the nearest node distance data
    • take (n): Returns an array, the first n elements of the data set composed. Note that this is not currently operate on multiple nodes in parallel, but Driver program calculates all the elements where the machine, stand-alone (Gateway memory pressure increases, you need to use caution)
    • The first element of the returned data set (analogous to take (1)): first ()
    • saveAsTextFile (path): the elements of the data set, as textfile, and saved to the local file system, or any other Hadoop HDFS supported file system. Spark method toString of each element will be called, and converts it into a line of text file
    • saveAsSequenceFile (path): the elements of the data set to sequencefile format, stored in a special directory, the local system, or any other hadoop HDFS supported file system. RDD element must consist of key-value, and have achieved the Hadoop Writable interfaces, or can be converted implicitly Writable (including the Spark basic type of conversion, e.g. Int, Double, String, etc.)
    • foreach (func): on each element of the data set, run the function func. This is typically used to update a variable accumulator, or do interactive systems and external storage

FIG space -1 RDD

Four, R DD sources

1、使用程序中的集合创建RDD(用于小量测试)

package com.imf.spark.rdd  
  
import org.apache.spark.{SparkConf, SparkContext}  
  
/** 
  * Created by lujinyong168 on 2016/2/2. 
  * DT大数据梦工厂-IMF 
  * 使用程序中的集合创建RDD(用于小量测试) 
  */  
object RDDCreateByCollections {  
def main(args: Array[String]) {  
val conf = new SparkConf()//创建SparkConf对象  
conf.setAppName("RDDCreateByCollections")//设置应用名称  
conf.setMaster("local")  
val sc = new SparkContext(conf)//创建SparkContext对象  
    //创建一个Scala集合  
val numbers = 1 to 100  
val rdd = sc.parallelize(numbers)  
//    val rdd = sc.parallelize(numbers,10)//设置并行度为10  
val sum = rdd.reduce(_+_)  
println("1+2+3+...+99+100="+sum)  
  }  
}  

2、使用本地文件系统创建RDD(测试大数据) 

package com.imf.spark.rdd  
  
import org.apache.spark.{SparkConf, SparkContext}  
  
/** 
  * Created by lujinyong168 on 2016/2/2. 
  * DT大数据梦工厂-IMF 
  * 使用本地文件系统创建RDD(测试大量数据) 
  * 统计文本中的字符个数 
  */  
object RDDCreateByLocal {  
def main(args: Array[String]) {  
val conf = new SparkConf()//创建SparkConf对象  
conf.setAppName("RDDCreateByLocal")//设置应用名称  
conf.setMaster("local")  
val sc = new SparkContext(conf)//创建SparkContext对象  
val rdd = sc.textFile("D://testspark//WordCount.txt")  
val linesLen = rdd.map(line=>line.length)  
val sum = linesLen.reduce(_+_)  
println("The total characters of the file is : "+sum)  
  }  
}  

3、使用HDFS创建RDD(生产环境最常用的RDD创建方式)

package com.imf.spark.rdd  
  
import org.apache.spark.{SparkConf, SparkContext}  
  
/** 
  * Created by lujinyong168 on 2016/2/2. 
  * DT大数据梦工厂-IMF 
  * 使用HDFS创建RDD(生产环境最常用的RDD创建方式) 
  */  
object RDDCreateByHDFS {  
def main(args: Array[String]) {  
val conf = new SparkConf()//创建SparkConf对象  
conf.setAppName("RDDCreateByHDFS")//设置应用名称  
conf.setMaster("local")  
val sc = new SparkContext(conf)//创建SparkContext对象  
val rdd = sc.textFile("/library/")  
val linesLen = rdd.map(line=>line.length)  
val sum = linesLen.reduce(_+_)  
println("The total characters of the file is : "+sum)  
  }  

4、基于DB创建RDD

5、基于NoSQL,例如HBase 

6、基于S3创建RDD

7、基于数据流创建RDD

五、RDD示例,检查Hadoop文件系统(HDFS)中的日志文件(TB级大小)来找出大型网站出错原因

 

    lines = spark.textFile("hdfs://...")
    errors = lines.filter(_.startsWith("ERROR"))
    errors.cache()

注:第1行从HDFS文件定义了一个RDD(即一个文本行集合),第2行获得一个过滤后的RDD,第3行请求将errors缓存(cache)起来,最初的RDD lines不会被缓存。因为错误信息可能只占原数据集的很小一部分(小到足以放入内存)。

errors.count()

注:集群还没有开始执行任何任务。但是,用户已经可以在这个RDD上执行对应的动作action

// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
    .map(_.split('\t')(3))
    .collect()

 注:在RDD上执行更多的转换操作transformation

 六、RDD与分布式共享内存

1、分布式共享内存(Distributed Shared Memory/DMS):不仅指传统的共享内存系统,还包括那些通过分布式哈希表分布式文件系统进行数据共享的系统

2、RDD与分布式共享内存(DMS)的对比:

  第一点:DMS、应用可以向全局地址空间的任意位置进行读写操作。RDD、不仅可以对任意内存位置读写,还可以通过批量转换创建(即“写”)RDD。

  第二点:DSM则难以实现备份任务,因为任务及其副本都需要读写同一个内存位置。通过备份任务的拷贝,RDD还可以处理落后任务(即运行很慢的节点)

  第三点:与DSM相比,RDD模型有两个好处。第一,对于RDD中的批量操作,运行时将根据数据存放的位置来调度任务,从而提高性能。第二,对于基于扫描的操作,如果内存      不足以缓存整个RDD,就进行部分缓存。把内存放不下的分区存储到磁盘上,此时性能与现有的数据流系统差不多。

表-1 RDD与分布式共享内存对比

对比项目 RDD 分布式共享内存(DSM)
批量或细粒度操作 细粒度操作
批量转换操作 细粒度操作
一致性 不重要(RDD是不可更改的) 取决于应用程序或运行时
容错性 细粒度,低开销(使用Lineage) 需要检查点操作和程序回滚
落后任务的处理 任务备份 很难处理
任务安排 基于数据存放的位置自动实现 取决于应用程序(通过运行时实现透明性)
如果内存不够 与已有的数据流系统类似 性能较差

 

 

Guess you like

Origin www.cnblogs.com/yinminbo/p/11832919.html