[Interview] Basics of Big Data (1)

0. Question outline

1. Basic concepts of big data

1、大数据的特性. 
2、大数据流处理技术之间的(实现)区别和联系(*2),批处理技术呢?
3、分布式系统CAP理论,重点解释分区容错性的意义。
4、Hadoop1.x和2.x之间的区别。
5、介绍下MapReduce (*3)。
 - 追问1:MapReduce中间有个combine是干嘛,有什么好处,有什么使用限制吗?
 - 追问2:拿MapReduce join两个表 说一下流程
 - 追问3:敲代码:用mr实现top10
6、HA HDFS ZooKeeper什么作用,为什么要ZooKeeper?(*2

1. Basic concepts of big data

1. The characteristics of big data

Answer: 4V: Volume, Variety, Value and Velocity, which means large volume, diversity, low value density, and fast speed.
1

2. The (implementation) difference and connection between big data stream processing technologies (*2), what about batch processing technology?
Batch only Stream only mixing
Hadoop Storm、Samza Spark ink Flink

Batch processing is aimed at large-capacity static data sets , and returns results after processing. The characteristics of its data set are bounded, persistent, and large .

Stream processing performs calculations on real-time data, without operating the entire data set, but operating each data item transmitted . Its data set is " unbounded ".

Spark can provide high-speed batch and micro-batch mode of stream processing. Flink's batch processing is to a large extent an extension of stream processing. At this time, it no longer reads from continuous streams, but reads bounded data from persistent storage. set.
Ecosystem-Components

3. The CAP theory of distributed systems focuses on explaining the meaning of partition fault tolerance.
  1. Content : When reading and writing operations in a distributed system, only two of the three of Consistence, Availability, and Partition Tolerance can be guaranteed, and the other must be sacrificed.

  2. may

  • CP (Consistency + Partition Tolerance)
    3

The connection between Node1 node and Node2 node is interrupted, resulting in a partition phenomenon. Node1 data has been updated to y. At this time, the Node1-Node2 replication channel is interrupted, and data y cannot be synchronized to Node2, and Node2 node data is still old data.
At this time, when client C accesses Node2, Node2 needs to return Error, prompting the client that "the system has an error now". This processing method violates the A-availability requirement.

  • AP (availability + partition tolerance)
    4

The data of Node2 is the old x. At this time, when the client accesses Node2, Node2 returns the current data x to the client, but in fact the latest data is already y, which does not meet the requirements of C-consistency. Therefore, CAP can only satisfy AP.

4. The difference between Hadoop 1.x and 2.x

1) Architecture improvement: The architecture improvement of the core components MapReduce and HDFS.
2) Rich components: new components such as Pig, Tez, Spark and Kafka have been added
2

5. Introduce MapReduce (*3)

Answer: MapReduce is a computing model for mass work. It uses the idea of ​​divide and conquer. The data is first decomposed (Map), and then merged into the final result (Reduce).

Follow-up 1: Why is there a combiner in MapReduce? What are the benefits? Are there any restrictions on use?

Answer: Combiner is a partial summary of the output of the map task to reduce network transmission.

Follow-up 2: Take MapReduce join two tables and talk about the process.

Answer:
1) Reduce side join: In the map phase, the map function reads two files file1 and file2 at the same time. In order to distinguish the key/value data pairs from two sources, each data is tagged with a tag.
2) Map side join: The map side is for the scenario of connecting large and small tables. Small tables can be directly placed in memory, so that multiple small tables can be copied, so that each map task will store one copy in the memory, and then only scan large tables; for large tables For each record key/value in the hash table, look up whether there is the same key, if there is, then output after connecting.

Follow-up 3: Knock the code: use mr to achieve top10

Answer: You can customize the groupingcomparator, sort the results by the maximum value, and then reduce the output, and only output the first 10 numbers.

代码:待补充……
6. What is the role of HA HDFS ZooKeeper and why ZooKeeper? (*2)

Answer: Zookeeper guarantees that the Standby NameNode will be changed to the Active state in time when the Active NameNode fails, which solves the single point of failure problem.

Two, reference

1. A summary of the real questions for the 2020 big data interview questions (with answers)
2. A comprehensive analysis of the differences between big data frameworks
3. Big data framework comparisons: Hadoop, Storm, Samza, Spark and Flink
4. MapReduce for distributed processing
5. Hadoop interviews, It is enough.
6. Hadoop related interview questions summary
7. Big data Hadoop interview questions (3)-MapReduce
8. The role of zookeeper in Hadoop cluster (1)
9. Big data Hadoop ecosystem-component introduction

Guess you like

Origin blog.csdn.net/HeavenDan/article/details/112284642