News real-time analysis system Spark2.X Distributed Resilient Data Set

1. Introduction of the three elasticity data set

1) concept

 

2) compare the advantages and disadvantages

 

 

 

2.Spark RDD Overview and create ways

1 Overview

Behind the cluster, there is a very important distributed data architecture that resilient distributed datasets (resilientdistributed dataset, RDD), which is a logical entity focused, conducted data partition on a cluster of multiple machines. Spark RDD is the core data structure, formed by the scheduling order Spark dependence of RDD. Spark form the whole of the program by the operation of the RDD.

2) Create a way

a) create a way

val data = Array(1, 2, 3, 4, 5)

val distData = sc.parallelize(data)

b) Create a Second way

scala> val distFile = sc.textFile("data.txt")

distFile: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at textFile at < console >:26

3.spark RDD five properties

 

4.spark RDD operation

1) RDD is a lazy executed until the Action phase will really perform.

 

2) RDD three operations

 

a) Transfamation function

 

b) Action Function

 

c) the specific use

 

5.DataFrame way to create and function

1) What is DataFrame

 

2) DataFrame comparison with RDD

 

3) DataFrame comparison with DataSet

 

4) create a way: RDD conversion DataFrame

 

5) create a way: DataSet conversion DataFrame

 

6.DataSet way to create and function

Create a DataSet way

 

7.Spark2.X source code analysis

Download Spark2.2-src source package, and then extract the tool to export the idea.

Comparison and conversion between data sets 8.

1) RDD and operation data DataSet

 

 

 

 

 

 

 

 

2) conversion operation

DataFrame / Dataset 转 eet

 

Packet sequencing

 

 

Guess you like

Origin www.cnblogs.com/misliu/p/11482391.html