Big Data whole stack master crash --Spark2.0 (full project combat scenes)

About Spark

     spark binding yarn can be easily and directly call the HDFS, Hbase above data, and hadoop binding. Configuration is easy.

     spark rapid development framework more flexible and practical than hadoop. Reduces the processing delay and improve performance efficiency and practical flexibility. It can be effectively combined with each other and hadoop. 

     spark into the core of the RDD. Spark SQL, Spark Streaming, MLlib, GraphX, Spark R and other core components to solve a lot of big data problems, its perfect frame day popular. Their respective ecological environment including visualization zepplin, etc., is growing. Large utility companies scramble to replace the original spark hadoop corresponding functional modules. Unlike the process of reading and writing the overflow Spark hadoop written to disk, all based on the memory, so fast. In addition width DAG job scheduling system depends let Spark faster.

 


 

Spark core component

1, eet

     Is resilient distributed datasets, perfectly elastic, if the data is lost part can be rebuilt. Automatic fault tolerance and scalability of location-aware scheduling, Jin Xiangrong update error check data by the checkpoint and recording data. By SparkContext.textFile () to load the file into RDD, then construct new RDD transformation, by the action RDD stored in the external system.

     RDD using lazy loading, lazy loading is only used when the data was loaded. If you load all the intermediate storage process would be a waste of space. Therefore, to delay loading. Once the spark see the whole chain of transformations, he can calculate the resulting data only if the following function is not required data then the data will no longer be loaded. RDD conversion is inert, can only use them in action.

     Spark divided driver and executor, driver submit jobs, executor is the application on early worknode process, running task, driver corresponds to sparkcontext. Spark's operations have RDD transformation, action. Transformation of packaging RDD dependency, corresponding RDD depends DAG are constructed and stored, a backup and recovery obtained by addition may also be recalculated by its dependency metadata stored after worknode hang. When the job is submitted calling runJob, spark will be constructed according to the RDD DAG map, submitted to the DAGScheduler, this DAGScheduler is initialized when you create together in SparkContext, he would work scheduling process. When the dependency graph building after a good start from the action parses each operation as a task, every encounter shuffle to cut into a taskSet, and outputs the data to disk, if not shuffle data is still stored in memory. So then move forward, until there is no operator, then began to run from the front, if there is no operator action here will not be executed until the action began to run encounter, which formed a spark of lazy loading, taskset submitted TaskSheduler TaskSetManager to generate and submit to the Executor to run, run back to the DAGScheduler after the completion of a taskSet, then submit the next one, when taskSet fails to return DAGScheduler and re-created again. Which may have a plurality of job TaskSet, it may contain a plurality of job application.

 


 

2、Spark Streaming 

     Kafka reading of data by the Stream data into small time segments (a few seconds), in a similar batch to batch mode processing section of this small data, each time slice to generate a RDD, has high fault tolerance, small bulk data is compatible with logic algorithm bulk real-time data processing, with some historical data and real-time data joint analysis, such as classification algorithms. Can also be mapreduce, join other operations of small quantities of stream, and to ensure the real-time nature. Time requirements for data flow less than a millisecond engineering problems can be.

     Spark Streaming also has a StreamingContext, the core is DSTREAM, by including a Time as Key, RDD a structure value, each RDD contains a specific time interval data to the continuous RDD on the set of time-sequence consisting of a stream , can persist to persist. After receiving the continuous stream of data in a queue maintained in blockGenerator, the stream data into the queue, and the like after the processing interval in which the arrival of all the data combined into a RDD (this data interval). Its job submission and spark similar, but got RDD internal DStream when submitting produce and submit Job, RDD after the triggering action, to submit job to jobManager in JobQueue, and jobScheduler scheduling, JobScheduler submit the job to job spark of the scheduler then converted into a large number of job tasks distributed to spark cluster execution. Job generation from outputStream, and then reverse back trigger execution DStreamDAG. In the process flow of the data processing, the processing node failure is typically more complex than the offline data. Spark streamin may be periodically written to HDFS DStream after 1.3, while the offset will be stored, is written to avoid zk. Once the primary node fails, the data will be read by the previous checkpoint manner. When worknode node fails, or if the HDFS that Spark source file as input data are recalculated according to dependency, if it is based on different nodes Kafka, Flume spark other network will phone the data source data source cluster backup, once a working node fails, the system can be recalculated based on the data of another there, but if you fail to accept the node part of the data will be lost, while accepting the thread will reboot and accept data on other nodes.

 


 

3、Graphx

     It is intended primarily for the FIG. The core algorithm PageRank, SVD singular matrix, TriangleConut and so on.

 


 

4、Spark SQL

    是Spark新推出的交互式大数据SQL技术。把sql语句翻译成Spark上的RDD操作可以支持Hive、Json等类型的数据。

 


 

5、Spark R

     通过R语言调用spark,目前不会拥有像Scala或者java那样广泛的API,Spark通过RDD类提供Spark API,并且允许用户使用R交互式方式在集群中运行任务。同时集成了MLlib机器学习类库。

 


 

6、MLBase

     从上到下包括了MLOptimizer(给使用者)、MLI(给算法使用者)、MLlib(给算法开发者)、Spark。也可以直接使用MLlib。ML Optimizer,一个优化机器学习选择更合适的算法和相关参数的模块,还有MLI进行特征抽取和高级ML编程 抽象算法实现API平台,MLlib分布式机器学习库,可以不断扩充算法。MLRuntime基于spark计算框架,将Spark的分布式计算应用到机器学习领域。MLBase提供了一个简单的声明方法指定机器学习任务,并且动态地选择最优的学习算法。

 


 

7、Tachyon

      高容错的分布式文件系统。宣称其性能是HDFS的3000多倍。有类似java的接口,也实现了HDFS接口,所以Spark和MR程序不需要任何的修改就可以运行。目前支持HDFS、S3等。


 

8、Spark算子

1、Map。对原数据进行处理,类似于遍历操作,转换成MappedRDD,原分区不变。

2、flatMap。将原来的RDD中的每一个元素通过函数转换成新的元素,将RDD的每个集合中的元素合并成一个集合。比如一个元素里面多个list,通过这个函数都合并成一个大的list,最经典的就是wordcount中将每一行元素进行分词以后成为,通过flapMap变成一个个的单词,line.flapMap(_.split(“ ”)).map((_,1))如果通过map就会将一行的单词变成一个list。

3、mapPartitions。对每个分区进行迭代,生成MapPartitionsRDD。

4、Union。是将两个RDD合并成一个。使用这个函数要保证两个RDD元素的数据类型相同,返回的RDD的数据类型和被合并的RDD数据类型相同。

5、Filter。其功能是对元素进行过滤,对每个元素调用f函数,返回值为true的元素就保留在RDD中。

6、Distinct。对RDD中元素进行去重操作。

7、Subtract。对RDD1中取出RDD1与RDD2交集中的所有元素。

8、Sample。对RDD中的集合内元素进行采样,第一个参数withReplacement是true表示有放回取样,false表示无放回。第二个参数表示比例,第三个参数是随机种子。如data.sample(true, 0.3,new Random().nextInt())。

9、takeSample。和sample用法相同,只不第二个参数换成了个数。返回也不是RDD,而是collect

10、Cache。将RDD缓存到内存中。相当于persist(MEMORY_ONLY)。可以通过参数设置缓存和运行内存之间的比例,如果数据量大于cache内存则会丢失。

11、Persist。里面参数可以选择DISK_ONLY/MEMORY_ONLY/MEMORY_AND_DISK等,其中的MEMORY_AND_DISK当缓存空间满了后自动溢出到磁盘。

12、MapValues。针对KV数据,对数据中的value进行map操作,而不对key进行处理。

13、reduceByKey。针对KV数据将相同key的value聚合到一起。与groupByKey不同,会进行一个类似mapreduce中的combine操作,减少相应的数据IO操作,加快效率。如果想进行一些非叠加操作,我们可以将value组合成字符串或其他格式将相同key的value组合在一起,再通过迭代,组合的数据拆开操作。

14、partitionBy。可以将RDD进行分区,重新生成一个ShuffleRDD,进行一个shuffle操作,对后面进行频繁的shuffle操作可以加快效率。

15、randomSplit。对RDD进行随机切分。如data.randomSplit(new double[]{0.7, 0.3})返回一个RDD的数组。

16、Cogroup。对两个RDD中的KV元素,每个RDD中相同key中的元素分别聚合成一个集合。与reduceByKey不同的是针对两个RDD中相同的key的元素进行合并。

17、Join。相当于inner join。对两个需要连接的RDD进行cogroup,然后对每个key下面的list进行笛卡尔积的操作,输出两两相交的两个集合作为value。 相当于sql中where a.key=b.key。

18、leftOutJoin,rightOutJoin。在数据库中左连接以左表为坐标将表中所有的数据列出来,右面不存在的用null填充。在这里面对join的基础上判断左侧的RDD元素是否是空,如果是空则填充。右连接则相反。

19、saveAsTestFile。将数据输出到HDFS的指定目录。

20、saveAsObjectFile。写入HDFS为SequenceFile格式。

21、Collect、collectAsMap。将RDD转换成list或者Map。结果以List或者HashMap的方式输出。

22、Count。对RDD的元素进行统计,返回个数。

23、Top(k)。返回最大的k个元素,返回List的形式。

24、Take返回数据的前k个元素。

25、takeOrdered。返回数据的最小的k个元素,并在返回中保持元素的顺序。


 

9、Tips

1、RDD.repartition(n)可以在最初对RDD进行分区操作,这个操作实际上是一个shuffle,可能比较耗时,但是如果之后的action比较多的话,可以减少下面操作的时间。其中的n值看cpu的个数,一般大于2倍cpu,小于1000。

2、Action不能够太多,每一次的action都会将以上的taskset划分一个job,这样当job增多,而其中task并不释放,会占用更多的内存,使得gc拉低效率。

3、在shuffle前面进行一个过滤,减少shuffle数据,并且过滤掉null值,以及空值。

4、groupBy尽量通过reduceBy替代。reduceBy会在work节点做一次reduce,在整体进行reduce,相当于做了一次hadoop中的combine操作,而combine操作和reduceBy逻辑一致,这个groupBy不能保证。

5、做join的时候,尽量用小RDD去join大RDD,用大RDD去join超大的RDD。

6、避免collect的使用。因为collect如果数据集超大的时候,会通过各个work进行收集,io增多,拉低性能,因此当数据集很大时要save到HDFS。

7、RDD如果后面使用迭代,建议cache,但是一定要估计好数据的大小,避免比cache设定的内存还要大,如果大过内存就会删除之前存储的cache,可能导致计算错误,如果想要完全的存储可以使用persist(MEMORY_AND_DISK),因为cache就是persist(MEMORY_ONLY)。

8、设置spark.cleaner.ttl,定时清除task,因为job的原因可能会缓存很多执行过去的task,所以定时回收可能避免集中gc操作拉低性能。

9、适当pre-partition,通过partitionBy()设定,每次partiti


Guess you like

Origin blog.51cto.com/14384035/2406338