[Interview] [Spark] Advanced Big Data (1)

0. Question outline

1. Spark basic concepts
1.1, Spark overview

1、Spark 1.X与Spark 2.X区别。
 - 追问1:运行效率方面,Spark比MapReduce更高。请说明效率更高来源于Spark内置的哪些机制?

2、Spark与Flink。

3、Sparkstreaming和Flink做实时处理的区别。

1.2 、 Spark RDD

1、Spark RDD介绍,属性,原理。
 - 追问1:创建rdd几种方式?
 - 追问2:RDD中的数据在哪?
 - 追问3:如果对RDD进行cache操作后,数据在哪里?
 - 追问4:cache和checkPoint的区别?

2、RDD,DataFrame,DataSet的区别?
 - 追问1:创建Dateset几种方式
 - 追问2:三者联系、转化?
 - 追问3:三者优缺点?

3、RDD支持哪几种类型的操作/算子分类?
 - 追问1:Transformation有哪些算子,其作用是?
 - 追问2:Action有哪些算子,其作用是?
 - 追问3:map和flatMap的区别(x2)和应用场景?
 - 追问4:sortBy 和 sortByKey的区别?
 - 追问5:groupBy和groupBykey的区别?
 - 追问6:groupBykey和reduceByKey的区别?
 - 追问7:cartesian和join的区别?
 - 追问8:聚合算子aggregate作用?……

4、Spark中RDD与Partition的区别
 - 追问1:Block VS Partition?

1.1, Spark overview

1. The difference between Spark 1.X and Spark 2.X.

  • 1) Performance: Optimize Spark memory and CPU usage, approaching the limit of physical machine performance.
  • 2) Function: Spark SQL-unified DataFrame and DataSet API, support SQL 2003; Spark Streaming-introduced Structured Streaming, Spark MILib 2.0 was born.
Follow-up 1: In terms of operating efficiency, Spark is higher than MapReduce. Please explain which mechanism of higher efficiency comes from Spark's built-in?

……

2. Spark and Flink.

Spark can provide high-speed batch processing and stream processing in micro-batch mode. Flink's batch processing is largely an extension of stream processing.

3. The difference between Sparkstreaming and Flink for real-time processing.

……

1.2 、 Spark RDD

1. Introduction, properties and principles of Spark RDD.

  1. Introduction : RDD (Resilient Distributed Dataset): Resilient distributed data set, Spark's most basic data abstraction.

  2. Features :

  • Partitions: the basic building blocks of data sets;
  • Calculation function (computer(p, context)): Each partition of RDD has a function, which can realize partition conversion between RDDs;
  • Dependencies (dependencies()): There are dependencies between RDDs;
  • Partitioner (partitioner()): RDD fragmentation function
  • Locality (preferredLocations()): calculation of delivery storage location;
  1. Example: WordCount RDD RDD
    3
Follow-up 1: How many ways to create rdd?
  1. Collection parallel creation
val arr = Array(1,2,3,4,5)
val rdd = sc.parallelize(arr)
val rdd =sc.makeRDD(arr)
  1. Read the external file system. Such as hdfs, or read local files (the most common way)
val rdd2 = sc.textFile("hdfs://hdp-01:9000/words.txt")
// 读取本地文件
val rdd2 = sc.textFile(“file:///root/words.txt”)
  1. Convert from the parent RDD (call the Transformation class method) into a new child RDD.
Follow-up 2: Where is the data in the RDD?

In the data source. RDD is just an abstract data set. By operating on RDD, we are equivalent to operating on data.

Follow-up 3: If the RDD is cached, where is the data?

The cache operator will be loaded into the memory of each Executor process the first time it is executed, and it will be read directly from the memory the second time.

Follow-up 4: The difference between cache and checkPoint?
  • Comparison : Cache puts the RDD calculation in the memory, but the dependency chain cannot be lost. If the node goes down, the RDD in the cache above will be lost and need to be recalculated; while checkpoint saves the RDD in HDFS, which is a reliable storage for multiple copies , So the dependency chain will be lost.
  • Comprehensive : RDD persistence is done, because checkpoint needs to recalculate the job from the beginning. It is best to cache it first. Checkpoint can directly save the RDD in the cache. There is no need to calculate it again, which greatly improves performance.

2. What is the difference between RDD, DataFrame and DataSet?

  • 1) RDD: A collection of immutable distributed elements, distributed across nodes in the cluster.

  • 2) DataFrame: A distributed data set based on RDD, similar to a two-dimensional table.

  • 3) DataSet: strongly typed domain-specific objects. Each DataSet has an untyped view of the DataFrame, which is a row of data sets.

  • 4)RDD VS DataFrame

4

the same different
Immutable distributed data set DataFrame data sets are stored in designated columns, that is, structured data , similar to tables .
  • 5)RDD VS DataSet
the same different
- DataSet is a collection of domain-specific objects, while RDD is any collection of objects.
- DataSet is strongly typed and can be used for optimization, not RDD.
  • 6)DataFrame VS DataSet
the same different
- DataFrame is weakly typed, and the type is checked during execution; DataSet is strongly typed, and is checked at compile time.
- DataFrame uses Java serial number/kyro serialization; DataSet is serialized through Encoder, supports dynamic code generation, and sorts directly at the bytes level.
Follow-up 1: Several ways to create Dateset

……

Follow-up 2: Contact and transformation of the three?

Contact: DataFrame and DataSet have unified API in Spark 2.X. RDD is more abstract at the bottom, and DataFrame/DataSet has been optimized for encapsulation, which is more convenient to use. Functionally, RDD is more powerful. DataFrame/DataSet can do RDD, but not vice versa.

Conversion:
1

Follow-up 3: Advantages and disadvantages of the three?
Object advantage Disadvantage
RDD 1) Powerful: Many built-in function operations are convenient for processing structured or unstructured data. 2) Object-oriented programming, safe type conversion. 1) Structured data (SQL) processing is troublesome. 2) Java serialization is expensive. 3) Data is stored in Java heap memory, and GC is frequent.
DataFrame 1) Structured data processing is convenient. 2) Targeted optimization. The serial number does not need to carry meta information, and the data is stored outside the heap, reducing the number of GCs. 3) Hive compatible, supports HQL, UDF, etc. 1) Security check cannot be performed during compilation. 2) Object support is not friendly. ROW objects are stored in memory, and objects cannot be customized.
DataSet 1) Support structured & unstructured data. 2) Support custom object query. 3) Off-heap memory storage, GC friendly. 4) The type conversion is safe, and the code is excellent. -
  • The official recommendation is to use DataSet.

3. What types of operations/operator classifications does RDD support?

Answer: [mainly from reference 9]

  • Transformation: The lazy feature will not be implemented.
  • Action: The program will actually execute when encountering this type of operator.
    Note: lazy feature-lazy loading, forming a calculation chain, reducing the waste of intermediate results.
Follow-up 1: What are the operators of Transformation and what are their functions?
Transformation effect Description
map Each element of the original RDD has a one-to-one mapping with the new element.
flatMap Each element of the original RDD is transformed into multiple new elements through a function, and then put into a set to form a new RDD.
filter filter.
union Combine RDDs of the same type without removing duplicates.
distinct Go heavy.
cartesian Returns the Cartesian product of two RDDs.
groupBy Generate the corresponding Key for the element, convert the data into Key-Value format, and then group the same elements with the Key into a group.
groupByKey Perform a grouping operation on the data in the RDD, call on a data set composed of (K, V) key-value pairs, and return a data set of (K, Seq[V]) pairs.
sortBy Sort each element of the RDD according to the given conditions.
sortByKey Sort each element of the RDD according to Key.
ReduceByKey Perform aggregation operations on each element pair of the RDD according to the Key.
join Each RDD element performs a join operation, and the two RDDs have the same key element
value Perform a join operation on each element pair in the two RDDs, take the Cartesian product of the values ​​of the elements with the same key, and return a value of type (key, (value1, value2))
Follow-up 2: What are the operators of Action and what are their functions?
Action effect Description
foreach Customize operations for each element in RDD, such as printing one by one
saveAsTextFile Save data to HDFS designated directory
collect Return a distributed RDD as a stand-alone scala Array array
count Returns the number of RDD elements
Follow-up 3: What is the difference (x2) and application scenarios between map and flatMap?

Answer: map is to operate on each element, and flatmap is to operate and flatten each element.
2

Follow-up 4: The difference between sortBy and sortByKey
sortBy既可以作用于RDD[K] ,还可以作用于RDD[(k,v)]
sortByKey  只能作用于 RDD[K,V] 类型上。

5

Follow-up 5: The difference between groupBy and groupBykey6
Follow-up 6: What is the difference between groupBykey and reduceByKey?

reduceByKey will be aggregated within a partition, which is more efficient. Generally, it is preferred to use in large amounts of data calculation; if it is only group, groupByKey is preferred.

5

Follow-up 7: What is the difference between cartesian and join?

6

Follow-up 8: What is the role of aggregation operator aggregate?

……

4. The difference between RDD and Partition in Spark

Answer: Partitions in Spark are distributed in each node, the smallest unit, and multiple Partitions together form an RDD. The partition size of the same piece of data (RDD) varies, and the number is variable, depending on the operator in the application and the number of initial data partitions. This is one of the reasons why it is called an elastic distributed data set.

Follow-up 1: Block VS Partition?

Answer: The block in HDFS is the smallest storage unit with fixed size and redundancy; while partition is the smallest calculation unit of RDD, which is not fixed and has no redundancy, and can be recalculated after loss.

Four, reference

1. A summary of the real questions for the 2020 big data interview questions (with answers)
2. Spark competes with Flink: the next generation of big data computing engine competition, who will control the ups and downs?
3. The relationship between Spark2.x and Spark1.x
4. What improvements has spark2.x made over spark1.x
5. Spark learning road (3) Spark RDD
6, Spark's core concept RDD
7, RDD, DataFrame and DataSet difference
8. Explanation of RDD, DataFrame and DataSet in Spark
9. [Spark Series 2] Difference and usage of reduceByKey and groupByKey
10. How to understand the relationship between partition and block in spark?

Guess you like

Origin blog.csdn.net/HeavenDan/article/details/112431164