1. Introduction of the three elasticity data set
1) concept
2) compare the advantages and disadvantages
2.Spark RDD Overview and create ways
1 Overview
Behind the cluster, there is a very important distributed data architecture that resilient distributed datasets (resilientdistributed dataset, RDD), which is a logical entity focused, conducted data partition on a cluster of multiple machines. Spark RDD is the core data structure, formed by the scheduling order Spark dependence of RDD. Spark form the whole of the program by the operation of the RDD.
2) Create a way
a) create a way
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
b) Create a Second way
scala> val distFile = sc.textFile("data.txt")
distFile: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at textFile at < console >:26
3.spark RDD five properties
4.spark RDD operation
1) RDD is a lazy executed until the Action phase will really perform.
2) RDD three operations
a) Transfamation function
b) Action Function
c) the specific use
5.DataFrame way to create and function
1) What is DataFrame
2) DataFrame comparison with RDD
3) DataFrame comparison with DataSet
4) create a way: RDD conversion DataFrame
5) create a way: DataSet conversion DataFrame
6.DataSet way to create and function
Create a DataSet way
7.Spark2.X source code analysis
Download Spark2.2-src source package, and then extract the tool to export the idea.
Comparison and conversion between data sets 8.
1) RDD and operation data DataSet
2) conversion operation
DataFrame / Dataset 转 eet
Packet sequencing