【Spark】Spark 高频面试题英语版(1)

今天更新 Spark 高频面试题英文版，分 Freshers，Experienced 1，Experienced 2 三部分。
音频文件点击下方获取。
【Spark】Spark 高频面试题英语版(1)
【Spark】Spark 高频面试题英语版(2)
【Spark】Spark 高频面试题英语版(3)

Apache Spark is an open-source, lightning-fast computation technology build based on Hadoop and MapReduce technologies that support various computational techniques for fast and efficient processing. Spark is known for its in-memory cluster computation that is the main contributing feature for increasing the processing speed of the spark applications. Spark was developed as part of Hadoop’s subproject by Matei Zaharia in 2009 at UC Berkeley’s AMPLab. It was later open-sourced in the year 2010 under the BSD License which was then donated to the Apache Software Foundation in the year 2013. From 2014 onwards, Spark grabbed the top-level position among all the projects undertaken by Apache Foundation.

Apache Spark 是一种基于 Hadoop 和 MapReduce 技术的开源、闪电般快速的计算技术，支持各种计算技术以实现快速高效的处理。Spark 以其内存集群计算而闻名，这是提高 Spark 应用程序处理速度的主要贡献特性。Spark 是 Matei Zaharia 于 2009 年在加州大学伯克利分校的 AMPLab 开发的，是 Hadoop 子项目的一部分。后来它在 2010 年在 BSD 许可证下开源，然后在 2013 年捐赠给 Apache 软件基金会。从 2014 年开始，Spark 在 Apache 基金会承担的所有项目中占据了顶级位置。

Spark Interview Questions for Freshers

1. Can you tell me what is Apache Spark about?

Apache Spark is an open-source framework engine that is known for its speed, easy-to-use nature in the field of big data processing and analysis. It also has built-in modules for graph processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-memory computation and cyclic data flow and it can run either on cluster mode or standalone mode and can access diverse data sources like HBase, HDFS, Cassandra, etc.

1. 你能告诉我什么是 Apache Spark 吗？

Apache Spark 是一个开源框架引擎，在大数据处理和分析领域以速度快、易用性着称。它还内置了图形处理、机器学习、流式处理、SQL 等模块。 Spark 执行引擎支持内存计算和循环数据流，可以在集群模式或独立模式下运行，并且可以访问多种数据源如 HBase、HDFS、Cassandra 等。

2. What are the features of Apache Spark?

High Processing Speed: Apache Spark helps in the achievement of a very high processing speed of data by reducing read-write operations to disk. The speed is almost 100x faster while performing in-memory computation and 10x faster while performing disk computation.
Dynamic Nature: Spark provides 80 high-level operators which help in the easy development of parallel applications.
In-Memory Computation: The in-memory computation feature of Spark due to its DAG execution engine increases the speed of data processing. This also supports data caching and reduces the time required to fetch data from the disk.
Reusability: Spark codes can be reused for batch-processing, data streaming, running ad-hoc queries, etc.
Fault Tolerance: Spark supports fault tolerance using RDD. Spark RDDs are the abstractions designed to handle failures of worker nodes which ensures zero data loss.
Stream Processing: Spark supports stream processing in real-time. The problem in the earlier MapReduce framework was that it could process only already existing data.
Lazy Evaluation: Spark transformations done using Spark RDDs are lazy. Meaning, they do not generate results right away, but they create new RDDs from existing RDD. This lazy evaluation increases the system efficiency.
Support Multiple Languages: Spark supports multiple languages like R, Scala, Python, Java which provides dynamicity and helps in overcoming the Hadoop limitation of application development only using Java.
Hadoop Integration: Spark also supports the Hadoop YARN cluster manager thereby making it flexible.
Supports Spark GraphX for graph parallel execution, Spark SQL, libraries for Machine learning, etc.
Cost Efficiency: Apache Spark is considered a better cost-efficient solution when compared to Hadoop as Hadoop required large storage and data centers while data processing and replication.
Active Developer’s Community: Apache Spark has a large developers base involved in continuous development. It is considered to be the most important project undertaken by the Apache community.

2. Apache Spark 有哪些特点？

在这里插入图片描述

高处理速度：Apache Spark 通过减少对磁盘的读写操作来帮助实现非常高的数据处理速度。执行内存计算时速度快近 100 倍，执行磁盘计算时速度快 10 倍。
动态特性：Spark 提供了 80 个高级操作符，有助于轻松开发并行应用程序。
内存计算：Spark 的内存计算特性由于其 DAG 执行引擎而提高了数据处理的速度。这还支持数据缓存并减少从磁盘获取数据所需的时间。
可重用性：Spark 代码可重用于批处理、数据流、运行即席查询等。
容错：Spark 支持使用 RDD 进行容错。Spark RDD 是旨在处理工作节点故障的抽象，可确保零数据丢失。
流处理：Spark 支持实时流处理。早期 MapReduce 框架的问题是它只能处理已经存在的数据。
惰性评估：使用 Spark RDD 完成的 Spark 转换是惰性的。这意味着，它们不会立即生成结果，而是从现有 RDD 中创建新的 RDD。这种惰性评估提高了系统效率。
支持多种语言：Spark 支持多种语言，如 R、Scala、Python、Java，提供动态性并有助于克服仅使用 Java 开发应用程序的 Hadoop 限制。
Hadoop 集成：Spark 还支持 Hadoop YARN 集群管理器，从而使其更加灵活。
支持用于图形并行执行的 Spark GraphX、Spark SQL、机器学习库等。
成本效率：与 Hadoop 相比，Apache Spark 被认为是一种更具成本效益的解决方案，因为 Hadoop 在数据处理和复制时需要大型存储和数据中心。
活跃的开发者社区：Apache Spark 拥有大量参与持续开发的开发者群体。它被认为是 Apache 社区承担的最重要的项目。

3. What is RDD?

RDD stands for Resilient Distribution Datasets. It is a fault-tolerant collection of parallel running operational elements. The partitioned data of RDD is distributed and immutable. There are two types of datasets:

Parallelized collections: Meant for running parallelly.
Hadoop datasets: These perform operations on file record systems on HDFS or other storage systems.

3. 什么是RDD？

RDD 代表弹性分布数据集。它是并行运行的操作元素的容错集合。RDD 的分区数据是分布式的和不可变的。有两种类型的数据集：

并行集合：用于并行运行。
Hadoop 数据集：这些数据集在 HDFS 或其他存储系统上的文件记录系统上执行操作。

4. What does DAG refer to in Apache Spark?

DAG stands for Directed Acyclic Graph with no directed cycles. There would be finite vertices and edges. Each edge from one vertex is directed to another vertex in a sequential manner. The vertices refer to the RDDs of Spark and the edges represent the operations to be performed on those RDDs.

4. Apache Spark 中的 DAG 指的是什么？

DAG 代表无有向环的有向无环图。会有有限的顶点和边。来自一个顶点的每条边以顺序方式指向另一个顶点。顶点指的是 Spark 的 RDD，边表示要在这些 RDD 上执行的操作。

5. List the types of Deploy Modes in Spark.

There are 2 deploy modes in Spark. They are:

5. 列出 Spark 中 Deploy Modes 的类型。

Spark 中有 2 种部署模式。他们是：

Client Mode: The deploy mode is said to be in client mode when the spark driver component runs on the machine node from where the spark job is submitted.
- The main disadvantage of this mode is if the machine node fails, then the entire job fails.
- This mode supports both interactive shells or the job submission commands.
- The performance of this mode is worst and is not preferred in production environments.
客户端模式：当 spark 驱动程序组件在提交 spark 作业的机器节点上运行时，部署模式称为客户端模式。
- 这种模式的主要缺点是如果机器节点失败，那么整个作业都会失败。
- 此模式支持交互式 shell 或作业提交命令。
- 这种模式的性能最差，在生产环境中不是首选。
Cluster Mode: If the spark job driver component does not run on the machine from which the spark job has been submitted, then the deploy mode is said to be in cluster mode.
- The spark job launches the driver component within the cluster as a part of the sub-process of ApplicationMaster.
- This mode supports deployment only using the spark-submit command (interactive shell mode is not supported).
- Here, since the driver programs are run in ApplicationMaster, in case the program fails, the driver program is re-instantiated.
- In this mode, there is a dedicated cluster manager (such as stand-alone, YARN, Apache Mesos, Kubernetes, etc) for allocating the resources required for the job to run as shown in the below architecture.
集群模式：如果 spark 作业驱动程序组件未在提交 spark 作业的机器上运行，则部署模式称为集群模式。
- spark 作业在集群内启动驱动程序组件，作为 ApplicationMaster 子进程的一部分。
- 此模式仅支持使用 spark-submit 命令进行部署（不支持交互式 shell 模式）。
- 这里，由于驱动程序运行在ApplicationMaster中，万一程序失败，驱动程序会被重新实例化。
- 在这种模式下，有一个专门的集群管理器（例如独立、YARN、Apache Mesos、Kubernetes 等）用于分配作业运行所需的资源，如下图所示。

在这里插入图片描述

Apart from the above two modes, if we have to run the application on our local machines for unit testing and development, the deployment mode is called “Local Mode”. Here, the jobs run on a single JVM in a single machine which makes it highly inefficient as at some point or the other there would be a shortage of resources which results in the failure of jobs. It is also not possible to scale up resources in this mode due to the restricted memory and space.

除了上述两种模式，如果我们必须在本地机器上运行应用程序进行单元测试和开发，部署模式称为“本地模式”。在这里，作业在单个机器上的单个 JVM 上运行，这使得它非常低效，因为在某些时候会出现资源短缺，从而导致作业失败。由于内存和空间有限，也无法在此模式下扩展资源。

6. What are receivers in Apache Spark Streaming?

Receivers are those entities that consume data from different data sources and then move them to Spark for processing. They are created by using streaming contexts in the form of long-running tasks that are scheduled for operating in a round-robin fashion. Each receiver is configured to use up only a single core. The receivers are made to run on various executors to accomplish the task of data streaming. There are two types of receivers depending on how the data is sent to Spark:

Reliable receivers: Here, the receiver sends an acknowledegment to the data sources post successful reception of data and its replication on the Spark storage space.
Unreliable receiver: Here, there is no acknowledgement sent to the data sources.

6. Apache Spark Streaming 中的接收器是什么？

接收器是那些消费来自不同数据源的数据，然后将它们移动到 Spark 进行处理的实体。它们是通过使用以循环方式运行的长时间运行任务的形式使用流式上下文创建的。每个接收器配置为仅使用一个内核。接收器运行在各种执行器上以完成数据流传输的任务。根据数据发送到 Spark 的方式，有两种类型的接收器：

可靠的接收器：在这里，接收器在成功接收数据并在 Spark 存储空间上复制后向数据源发送确认。
不可靠的接收者：这里没有向数据源发送确认。

7. What is the difference between repartition and coalesce?

Repartition	Coalesce
Usage repartition can increase/decrease the number of data partitions.	Spark coalesce can only reduce the number of data partitions.
Repartition creates new data partitions and performs a full shuffle of evenly distributed data.	Coalesce makes use of already existing partitions to reduce the amount of shuffled data unevenly.
Repartition internally calls coalesce with shuffle parameter thereby making it slower than coalesce.	Coalesce is faster than repartition. However, if there are unequal-sized data partitions, the speed might be slightly slower.

7. repartition 和 coalesce 有什么区别？

Repartition	Coalesce
使用重新分区可以增加/减少数据分区的数量。	Spark coalesce 只能减少数据分区的数量。
重新分区创建新的数据分区并执行均匀分布的数据的完全混洗。	Coalesce 利用已经存在的分区来不均匀地减少混洗数据的数量。
重新分区在内部使用 shuffle 参数调用 coalesce，从而使其比 coalesce 慢。	合并比重新分区更快。但是，如果存在大小不等的数据分区，则速度可能会稍慢。

8. What are the data formats supported by Spark?

Spark supports both the raw files and the structured file formats for efficient reading and processing. File formats like paraquet, JSON, XML, CSV, RC, Avro, TSV, etc are supported by Spark.

8. Spark支持哪些数据格式？

Spark 支持原始文件和结构化文件格式，以实现高效读取和处理。Spark 支持 paraquet、JSON、XML、CSV、RC、Avro、TSV 等文件格式。

9. What do you understand by Shuffling in Spark?

The process of redistribution of data across different partitions which might or might not cause data movement across the JVM processes or the executors on the separate machines is known as shuffling/repartitioning. Partition is nothing but a smaller logical division of data.

It is to be noted that Spark has no control over what partition the data gets distributed across.

9. 你对 Spark 中的 Shuffle 有什么理解？

跨不同分区重新分配数据的过程可能会或可能不会导致数据在 JVM 进程或不同机器上的执行程序之间移动，这被称为 shuffle/repartitioning。分区只不过是数据的更小的逻辑划分。

需要注意的是，Spark 无法控制数据分布在哪个分区。

在这里插入图片描述

10. What is YARN in Spark?

YARN is one of the key features provided by Spark that provides a central resource management platform for delivering scalable operations throughout the cluster.
YARN is a cluster management technology and a Spark is a tool for data processing.

10. Spark 中的 YARN 是什么？

YARN 是 Spark 提供的关键功能之一，它提供了一个中央资源管理平台，用于在整个集群中提供可扩展的操作。
YARN 是一种集群管理技术，Spark 是一种数据处理工具。