【Hadoop】Hadoop 高频面试题英语版(1)

今天开始更新 Hadoop 高频面试题英文版本，分为 Freshers 1，Freshers 2，Experienced 1，Experienced 2 四个部分。
音频文件点击下方获取。
【Hadoop】Hadoop 高频面试题英文版(1)
【Hadoop】Hadoop 高频面试题英文版(2)
【Hadoop】Hadoop 高频面试题英文版(3)

Apache Hadoop is an open-source software library used to control data processing and storage in big data applications. Hadoop helps to analyze vast amounts of data parallelly and more swiftly. Apache Hadoop was acquainted with the public in 2012 by The Apache Software Foundation(ASF). Hadoop is economical to use as data is stored on affordable commodity Servers that run as clusters.

Before the digital period, the volume of data gathered was slow and could be examined and stored with a single storage format. At the same time, the format of the data received for similar purposes had the same format. However, with the development of the Internet and digital platforms like social media, the data comes in multiple formats (structured, semi-structured, and unstructured), and its velocity also massively grown. A new name was given to this data which is Big data. Then, the need for multiple processors and storage units arose to handle the big data. Therefore, as a solution, Hadoop was introduced.

Apache Hadoop 是一个开源软件库，用于控制大数据应用程序中的数据处理和存储。Hadoop 有助于更快速地并行分析大量数据。Apache Hadoop 于 2012 年由 Apache 软件基金会 (ASF) 为公众所知。Hadoop 使用起来很经济，因为数据存储在作为集群运行的经济实惠的商品服务器上。

在数字时代之前，收集的数据量很慢，可以使用单一的存储格式进行检查和存储。同时，为类似目的接收的数据格式也相同。然而，随着互联网和社交媒体等数字平台的发展，数据以多种格式（结构化、半结构化和非结构化）出现，其速度也在大幅增长。该数据被赋予了一个新名称，即大数据。然后，需要多个处理器和存储单元来处理大数据。因此，作为解决方案，引入了 Hadoop。

Hadoop Interview Questions for Freshers

1. Explain big data and list its characteristics.

Gartner defined Big Data as–
“Big data” is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

Simply, big data is larger, more complex data sets, particularly from new data sources. These data sets are so large that conventional data processing software can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before.

1. 解释大数据并列出其特征。

Gartner 将大数据定义为——
“大数据”是海量、速度和种类繁多的信息资产，需要具有成本效益、创新的信息处理形式，以增强洞察力和决策能力。”

简单地说，大数据是更大、更复杂的数据集，尤其是来自新数据源的数据。这些数据集是如此之大，以至于传统的数据处理软件无法管理它们。但这些海量数据可用于解决您以前无法解决的业务问题。

Characteristics of Big Data are:

Volume: A large amount of data stored in data warehouses refers to Volume.
Velocity: Velocity typically refers to the pace at which data is being generated in real-time.
Variety: Variety of Big Data relates to structured, unstructured, and semistructured data that is collected from multiple sources.
Veracity: Data veracity generally refers to how accurate the data is.
Value: No matter how fast the data is produced or its amount, it has to be reliable and valuable. Otherwise, the information is not good enough for processing or analysis.

大数据的特点是：

体积：数据仓库中存储的大量数据是指体积。
速度：速度通常是指实时生成数据的速度。
多样性：各种大数据涉及从多个来源收集的结构化、非结构化和半结构化数据。
准确性：数据准确性通常是指数据的准确性。
价值：无论数据产生多快或数量多，它都必须可靠且有价值。否则，信息不足以进行处理或分析。

在这里插入图片描述

2. Explain Hadoop. List the core components of Hadoop

There are three components of Hadoop are:

Hadoop YARN - It is a resource management unit of Hadoop.
Hadoop Distributed File System (HDFS) - It is the storage unit of Hadoop.
Hadoop MapReduce - It is the processing unit of Hadoop.

Hadoop的三个组件是：

Hadoop YARN - 它是 Hadoop 的资源管理单元。
Hadoop 分布式文件系统 (HDFS) - 它是 Hadoop 的存储单元。
Hadoop MapReduce - 它是 Hadoop 的处理单元。

3. Explain the Storage Unit In Hadoop (HDFS).

HDFS is the Hadoop Distributed File System, is the storage layer for Hadoop. The files in HDFS are split into block-size parts called data blocks. These blocks are saved on the slave nodes in the cluster. By default, the size of the block is 128 MB by default, which can be configured as per our necessities. It follows the master-slave architecture. It contains two daemons- DataNodes and NameNode.

NameNode
The NameNode is the master daemon that operates on the master node. It saves the filesystem metadata, that is, files names, data about blocks of a file, blocks locations, permissions, etc. It manages the Datanodes.
DataNode
The DataNodes are the slave daemon that operates on the slave nodes. It saves the actual business data. It serves the client read/write requests based on the NameNode instructions. It stores the blocks of the files, and NameNode stores the metadata like block locations, permission, etc.

3. 解释 Hadoop 中的存储单元(HDFS) 。

HDFS 是 Hadoop 分布式文件系统，是 Hadoop 的存储层。HDFS 中的文件被分成块大小的部分，称为数据块。这些块保存在集群中的从节点上。默认情况下，块的大小默认为 128 MB，可以根据需要进行配置。它遵循主从架构。它包含两个守护进程——DataNodes 和 NameNode。

NameNode
NameNode 是在主节点上运行的主守护进程。它保存文件系统元数据，即文件名、有关文件块的数据、块位置、权限等。它管理 Datanodes。
DataNode
DataNodes 是在从属节点上运行的从属守护进程。它保存了实际的业务数据。它根据 NameNode 指令为客户端的读/写请求提供服务。它存储文件的块，NameNode 存储块位置、权限等元数据。

在这里插入图片描述

4. Mention different Features of HDFS.

Fault Tolerance
Hadoop framework divides data into blocks and creates various copies of blocks on several machines in the cluster. So, when any device in the cluster fails, clients can still access their data from the other machine containing the exact copy of data blocks.
High Availability
In the HDFS environment, the data is duplicated by generating a copy of the blocks. So, whenever a user wants to obtain this data, or in case of an unfortunate situation, users can simply access their data from the other nodes because duplicate images of blocks are already present in the other nodes of the HDFS cluster.
High Reliability
HDFS splits the data into blocks, these blocks are stored by the Hadoop framework on nodes existing in the cluster. It saves data by generating a duplicate of every block current in the cluster. Hence presents a fault tolerance facility. By default, it creates 3 duplicates of each block containing information present in the nodes. Therefore, the data is promptly obtainable to the users. Hence the user does not face the difficulty of data loss. Therefore, HDFS is very reliable.
Replication
Replication resolves the problem of data loss in adverse conditions like device failure, crashing of nodes, etc. It manages the process of replication at frequent intervals of time. Thus, there is a low probability of a loss of user data.
Scalability
HDFS stocks the data on multiple nodes. So, in case of an increase in demand, it can scale the cluster.

4. 指出 HDFS 的不同特性。

容错
Hadoop 框架将数据划分为块，并在集群中的多台机器上创建块的多个副本。因此，当集群中的任何设备发生故障时，客户端仍然可以从包含数据块的精确副本的另一台机器访问他们的数据。
高可用性
在 HDFS 环境中，通过生成块的副本来复制数据。因此，当用户想要获取这些数据时，或者在不幸的情况下，用户可以简单地从其他节点访问他们的数据，因为 HDFS 集群的其他节点中已经存在重复的块图像。
高可靠性
HDFS 将数据拆分为块，这些块由 Hadoop 框架存储在集群中现有的节点上。它通过生成集群中每个当前块的副本来保存数据。因此提出了一种容错设施。默认情况下，它会为每个包含节点中存在的信息的块创建 3 个副本。因此，用户可以迅速获得数据。因此，用户不会面临数据丢失的困难。因此，HDFS 非常可靠。
复制
复制解决了设备故障、节点崩溃等不利条件下的数据丢失问题。它以频繁的时间间隔管理复制过程。因此，用户数据丢失的可能性很小。
可扩展性
HDFS 将数据存储在多个节点上。因此，在需求增加的情况下，它可以扩展集群。

5. What are the Limitations of Hadoop 1.0 ?

Only one NameNode is possible to configure.
Secondary NameNode was to take hourly backup of MetaData from NameNode.
It is only suitable for Batch Processing of a vast amount of Data, which is already in the Hadoop System.
It is not ideal for Real-time Data Processing.
It supports up to 4000 Nodes per Cluster.
It has a single component: JobTracker to perform many activities like Resource Management, Job Scheduling, Job Monitoring, Re-scheduling Jobs etc.
JobTracker is the single point of failure.
It supports only one Name No and One Namespace per Cluster.
It does not help the Horizontal Scalability of NameNode.
It runs only Map/Reduce jobs.

5. Hadoop 1.0 的局限是什么？

只能配置一个 NameNode。
辅助 NameNode 每小时从 NameNode 备份元数据。
它只适用于 Hadoop 系统中已经存在的海量数据的批处理。
它不适合实时数据处理。
每个集群最多支持 4000 个节点。
它只有一个组件：JobTracker，用于执行许多活动，如资源管理、作业调度、作业监控、重新安排作业等。
JobTracker 是单点故障。
每个集群只支持一个 Name No 和一个 Namespace。
它对 NameNode 的水平可扩展性没有帮助。
它只运行 Map/Reduce 作业。

6. Compare the main differences between HDFS (Hadoop Distributed File System ) and Network Attached Storage(NAS) ?

HDFS	NAS
HDFS is a Distributed File system that is mainly used to store data by commodity hardware.	NAS is a file-level computer data storage server connected to a computer network that provides network access to a heterogeneous group of clients.
HDFS is programmed to work with the MapReduce paradigm.	NAS is not suitable to work with a MapReduce paradigm.
HDFS is Cost-effective.	NAS is a high-end storage device that is highly expensive.

6.比较HDFS（Hadoop分布式文件系统）和网络附加存储（NAS）的主要区别？

HDFS	NAS
HDFS是一种分布式文件系统，主要用于商品硬件存储数据。	NAS 是连接到计算机网络的文件级计算机数据存储服务器，为异构客户端组提供网络访问。
HDFS 被编程为与 MapReduce 范例一起工作。	NAS 不适合使用 MapReduce 范例。
HDFS 具有成本效益。	NAS 是一种高端存储设备，价格非常昂贵。

7. List Hadoop Configuration files.

Configuration Filenames	Description of log Files
hadoop-env.sh	Environment variables that are used in the scripts to run Hadoop.
core-site.xml	Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.
hdfs-site.xml	Configuration settings for HDFS daemons, the namenode,the secondary namenode and the data nodes.
mapred-site.xml	Configuration settings for MapReduce daemons: the job-tracker and the task-trakers.
masters	A list of machines (one per line) that each run a secondary namenode.
slaves	A list of machines (one per line) that each run a datanode and a task-tracker.

7. 列出 Hadoop 配置文件。

Configuration Filenames	Description of log Files
hadoop-env.sh	脚本中用于运行 Hadoop 的环境变量。
core-site.xml	Hadoop Core 的配置设置，例如 HDFS 和 MapReduce 通用的 I/O 设置。
hdfs-site.xml	HDFS 守护进程、名称节点、辅助名称节点和数据节点的配置设置。
mapred-site.xml	MapReduce 守护进程的配置设置：作业跟踪器和任务跟踪器。
masters	机器列表（每行一个），每个运行 secondary namenode。
slaves	机器列表（每行一个），每个运行一个 datanode 和一个任务跟踪器。

8. Explain Hadoop MapReduce.

Hadoop MapReduce is a software framework for processing enormous data sets. It is the main component for data processing in the Hadoop framework. It divides the input data into several parts and runs a program on every data component parallel at one. The word MapReduce refers to two separate and different tasks.

8. 解释 Hadoop MapReduce。

Hadoop MapReduce 是一个用于处理大量数据集的软件框架。它是 Hadoop 框架中数据处理的主要组件。它将输入数据分成几个部分，并在每个数据组件上并行运行一个程序。MapReduce 一词指的是两个独立且不同的任务。

在这里插入图片描述

The first is the map operation, which takes a set of data and transforms it into a different collection of data, where individual elements are divided into tuples. The reduce operation consolidates those data tuples based on the key and subsequently modifies the value of the key.

Let us take an example of a text file called example_data.txt and understand how MapReduce works.

第一个是 map 操作，它获取一组数据并将其转换为不同的数据集合，其中单个元素被划分为元组。reduce 操作根据键合并这些数据元组，然后修改键的值。

让我们以名为 example_data.txt 的文本文件为例，了解 MapReduce 的工作原理。

The content of the example_data.txt file is:
coding,jamming,ice,river,man,driving

Now, assume we have to find out the word count on the example_data.txt using MapReduce. So, we will be looking for the unique words and the number of times those unique words appeared.

example_data.txt文件内容为：
coding,jamming,ice,river,man,driving

现在，假设我们必须使用 MapReduce 找出 example_data.txt 上的字数。因此，我们将寻找唯一词以及这些唯一词出现的次数。

在这里插入图片描述

First, we break the input into three divisions, as seen in the figure. This will share the work among all the map nodes.
Then, all the words are tokenized in each of the mappers, and a hardcoded value (1) to each of the tokens is given. The reason behind giving a hardcoded value equal to 1 is that every word by itself will, at least, occur once.
Now, a list of key-value pairs will be created where the key is nothing but the individual words and value is one. So, for the first line (Coding Ice Jamming), we have three key-value pairs – Coding, 1; Ice, 1; Jamming, 1.
The mapping process persists the same on all the nodes.
Next, a partition process occurs where sorting and shuffling follow so that all the tuples with the same key are sent to the identical reducer.
Subsequent to the sorting and shuffling phase, every reducer will have a unique key and a list of values matching that very key. For example, Coding, [1,1]; Ice, [1,1,1]…, etc.
Now, each Reducer adds the values which are present in that list of values. As shown in the example, the reducer gets a list of values [1,1] for the key Jamming. Then, it adds the number of ones in the same list and gives the final output as – Jamming, 2.
Lastly, all the output key/value pairs are then assembled and written in the output file.
首先，我们将输入分成三个部分，如图所示。这将在所有 map 节点之间共享工作。
然后，在每个 mapper 中对所有单词进行标记，并给出每个标记的硬编码值 (1)。给出等于 1 的硬编码值的原因是每个单词本身至少会出现一次。
现在，将创建一个键值对列表，其中键只是单个单词，值是 1。所以，对于第一行（Coding Ice Jamming），我们有三个键值对——Coding，1；Ice，1；Jamming，1。
mapping 过程在所有节点上保持不变。
接下来，会发生一个分区过程，排序和 shuffle 随后进行，以便将具有相同键的所有元组发送到相同的 reducer。
在排序和 shuffle 阶段之后，每个 reducer 都会有一个唯一的键和一个与该键匹配的值列表。例如，Coding，[1,1]；Ice，[1,1,1]…等。
现在，每个 Reducer 都会添加该值列表中存在的值。如示例所示，reducer 获取键 Jamming 的值列表 [1,1]。然后，它将同一列表中的数量相加，并给出最终输出为——Jamming, 2。
最后，所有的输出键/值对被组装并写入输出文件。

9. What is shuffling in MapReduce?

In Hadoop MapReduce, shuffling is used to transfer data from the mappers to the important reducers. It is the process in which the system sorts the unstructured data and transfers the output of the map as an input to the reducer. It is a significant process for reducers. Otherwise, they would not accept any information. Moreover, since this process can begin even before the map phase is completed, it helps to save time and complete the process in a lesser amount of time.

9. MapReduce 中的 shuffle 是什么？

在 Hadoop MapReduce 中，shuffle 用于将数据从 mapper 传输到重要的 reducer。它是系统对非结构化数据进行排序并将 map 的输出作为输入传输到 reducer 的过程。对于 reducer 来说，这是一个重要的过程。否则，他们不会接受任何信息。此外，由于此过程甚至可以在映射阶段完成之前开始，因此有助于节省时间并在更短的时间内完成该过程。

10. List the components of Apache Spark.

Apache Spark comprises the Spark Core Engine, Spark Streaming, MLlib, GraphX, Spark SQL, and Spark R.

The Spark Core Engine can be used along with any of the other five components specified. It is not required to use all the Spark components collectively. Depending on the use case and request, one or more can be used along with Spark Core.

10. 列出 Apache Spark 的组件。

Apache Spark 包括 Spark Core Engine、Spark Streaming、MLlib、GraphX、Spark SQL 和 Spark R。

Spark Core Engine 可以与指定的其他五个组件中的任何一个一起使用。不需要一起使用所有 Spark 组件。根据用例和要求，一个或多个可以与 Spark Core 一起使用。

11. What are the three modes that hadoop can Run?

Local Mode or Standalone Mode
Hadoop, by default, is configured to run in a no distributed mode. It runs as a single Java process. Instead of HDFS, this mode utilizes the local file system. This mode is more helpful for debugging, and there isn’t any requirement to configure core-site.xml, hdfs-site.xml, mapred-site.xml, masters & slaves. Stand-alone mode is ordinarily the quickest mode in Hadoop.
Pseudo-distributed Model
In this mode, each daemon runs on a separate java process. This mode requires custom configuration ( core-site.xml, hdfs-site.xml, mapred-site.xml). The HDFS is used for input and output. This mode of deployment is beneficial for testing and debugging purposes.
Fully Distributed Mode
It is the production mode of Hadoop. Basically, one machine in the cluster is designated as NameNode and another as Resource Manager exclusively. These are masters. Rest nodes act as Data Node and Node Manager. These are the slaves. Configuration parameters and environment need to be defined for Hadoop Daemons. This mode gives fully distributed computing capacity, security, fault endurance, and scalability.

11、hadoop可以运行的三种模式是什么？

本地模式或独立模式
Hadoop 默认配置为以非分布式模式运行。它作为单个 Java 进程运行。此模式不使用 HDFS，而是使用本地文件系统。这种模式更利于调试，不需要配置core-site.xml、hdfs-site.xml、mapred-site.xml、masters & slaves。单机模式通常是 Hadoop 中最快的模式。
伪分布式模型
在这种模式下，每个守护进程都运行在一个单独的 java 进程上。此模式需要自定义配置（core-site.xml、hdfs-site.xml、mapred-site.xml）。HDFS 用于输入和输出。这种部署模式有利于测试和调试目的。
全分布式模式
是 Hadoop 的生产模式。基本上，集群中的一台机器被指定为 NameNode，另一台机器被指定为资源管理器。这些都是大师。休息节点充当数据节点和节点管理器。这些是奴隶。需要为 Hadoop 守护进程定义配置参数和环境。这种模式提供了完全分布式的计算能力、安全性、容错性和可扩展性。