Spark study notes

Spark basics and basic concepts

1. What is Spark?
a) Cluster computing
b) Extended MapReduce
c) Memory computing
2.Spark components
a) Spark Core
b) Spark SQL
c) Spark Streaming
d) MLlib (classification, regression, clustering, collaborative filtering)
e) GraphX ​​parallel graph computing
f ) YARN
g) Mesos
3. Spark core concepts
a) Driver program
b) RDD
c) SparkContext
d) Maven and sbt packaging
RDD programming 1. RDD

resilient distributed dataset
2. Two ways to create RDD
a) Read Take external dataset
i. eg: lines=sc.textFile("README.md");
b) Parallelize a collection
i. eg: lines=sc.parallelize(List("pandas","I like pandas”))
3. How Spark works
a) Create input RDD from external data
a) Use transformation RDD such as filter() to define new RDD
b) Perform persist() operation on RDD that needs to be reused, persistent into memory
c) Execute the action method to trigger parallel computing, Spark will perform the
4.Rdd operation after optimizing the calculation
a) Convert
i.XXXRDD.Map(x=x*x)
ii.XXXRDD.filter(line=>line.contains(“ error”))
iii.XXXRDD.union(VVVRDD)
iv.flatMap
v.Distinct
vi.Intersection
vii.Subtract
viii.cartesian
ix.Collect
b)Action
i.Count
ii.Take
iii.saveAsTextFile()
iv.Reduce
v.Fold
vi.Aggregate()
vii.Collect
viii.Top
ix.takeSample
x.foreach
c) Persistence()
key-value pair operation
1. Conversion method
a) Ordinary RDD is converted into pairRDD by map() method
2. Conversion operation
a) Conversion operations for a single pairRDD
i.reduceByKey(func)
ii.groupByKey()
iii.combineBy
iv. . . .
b) The transformation operation
i.subtractByKey
ii.Join
iii. for the two pairRDDs. . .
3. Aggregation operation

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326589154&siteId=291194637