What do Hadoop, HDFS, MapReduce, Habse, Spark, Yarn do?

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without understanding the underlying details of distributed. Make full use of the power of the cluster for high-speed computing and storage. Hadoop implements a distributed file system (Hadoop Distributed File System), one of which is HDFS. HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput (high throughput) to access application data, suitable for those with large data sets (large data sets). set) application. HDFS relaxes the requirements of POSIX, and can access data in the file system in the form of streaming access. The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides storage for massive amounts of data, while MapReduce provides calculations for massive amounts of data

  1. Hadoop 1.0Insert picture description here
  2. Hadoop 2.0
    Insert picture description here

HDFS

HDFS (Hadoop Distributed FileSystem) is essentially designed for a large amount of data that can span hundreds of thousands of machines, but what you see is one file system instead of many file systems. For example, if you say that I want to get the data of /hdfs/tmp/file1, you are referring to a file path, but the actual data is stored on many different machines. As a user, you don't need to know these, just like you don't care about the tracks and sectors of the files on a single machine.

MapReduce

Consider if you want to count a huge text file stored on a similar HDFS, you want to know the frequency of each word in this text. You started a MapReduce program. Map distributes text files to different machines, and Reduce collects and aggregates the frequency of words on each machine.

Yarn

With so many messy tools, all running on the same cluster, everyone needs to respect each other and work in an orderly manner. So another important component is the scheduling system. The most popular one is Yarn. From an open source perspective, the proposal of YARN has weakened the dispute over the pros and cons of multiple computing frameworks to a certain extent. YARN evolved on the basis of Hadoop MapReduce. In the MapReduce era, many people criticized that MapReduce is not suitable for iterative computing and lost computing. So computing frameworks such as Spark and Storm appeared. The developers of these systems are on their own websites. Or compare it with MapReduce in the paper, advocating how advanced and efficient your system is. After the emergence of YARN, the situation has become clear: MapReduce is just a type of application abstraction running on YARN, and Spark and Storm are essentially the same, they just For different types of application development, there is no difference between advantages and disadvantages. Each has its own strengths and merges and coexists. Moreover, the development of all computing frameworks in the future, if no surprises, should also be on top of YARN. In this way, an ecosystem with YARN as the underlying resource management platform and multiple computing frameworks running on it was born.

Hive、Pig

The MapReduce program is really troublesome to write. They hope to simplify this process. It's like you have assembly language. Although you can do almost everything, you still feel cumbersome. You want a higher and more abstract language layer to describe algorithms and data processing flows. So there was Pig and Hive. Pig is close to scripting to describe MapReduce, while Hive uses SQL. They translate scripts and SQL languages ​​into MapReduce programs, and leave them to the computing engine to calculate, and you are freed from the tedious MapReduce programs and write programs in simpler and more intuitive languages.

Tez、Spark

To run Hive on MapReduce, the SQL language must first be translated into MapReduce programs, which is very slow. . But for data analysis, people always hope to run faster. For example, I want to see how many people stopped on a certain page in the past hour and how long they stayed. For a huge website with massive data, this process may take tens of minutes or even many hours. Hive on Tez/Spark and SparkSQL. Their design philosophy is that MapReduce is slow, but if I use a new generation of general-purpose computing engine Tez or Spark to run SQL, then I can run faster. And users do not need to maintain two systems.

Storm、Streaming

If I were a company similar to Weibo, I would like to show not a 24-hour hot blog. I would like to watch a constantly changing hot list. The update delay is within one minute, and the above methods will not be adequate. So another computing model was developed, which is Streaming (streaming) computing. Storm is the most popular streaming computing platform. The idea of ​​stream computing is, if you want to achieve more real-time updates, why don't I process it when the data stream comes in? For example, it is still an example of word frequency statistics. My data stream is word by word, and I let them flow through me and start counting. Streaming calculation is very powerful and basically has no delay, but its shortcoming is that it is not flexible. You must know in advance what you want to count. After all, the data flow is gone, and the things you haven't counted cannot be compensated. So it is a good thing, but it cannot replace the above data warehouse and batch processing system.

Hive、Hbase

You can simply understand that hive is a view of files, and hbase is a key-value table with indexes.

  1. Tables in Hive are purely logical tables, just the definition of the table, that is, the metadata of the table. Hive itself does not store data, it completely relies on HDFS and MapReduce. In this way, the structured data file can be mapped as a database table, and a complete SQL query function can be provided, and the SQL statement can be finally converted into a MapReduce task for execution. The HBase table is a physical table, suitable for storing unstructured data.
  2. Hive is based on MapReduce to process data, and MapReduce processes data in a row-based mode; HBase processes data in a column-based rather than row-based mode, which is suitable for random access to massive data.
  3. HBase tables are loosely stored, so users can define different columns for rows; while Hive tables are dense, that is, how many columns are defined, and each row has data for a fixed number of columns.
  4. Hive uses Hadoop to analyze and process data, and the Hadoop system is a batch processing system, so it cannot guarantee the low latency of processing; while HBase is a near real-time system that supports real-time query.
  5. Hive does not provide row-level updates. It is suitable for batch processing of a large number of append-only data sets (such as logs). The query based on HBase supports row-level updates.
  6. Hive provides a complete SQL implementation, which is usually used for mining and analysis based on historical data. HBase is not suitable for application scenarios with joins, multi-level indexes, and complex table relationships.

Zookeeper

ZooKeeper is a distributed, open source distributed application coordination service, an open source implementation of Google’s Chubby, and an important component of Hadoop and Hbase. It is a software that provides consistent services for distributed applications. The functions provided include: configuration maintenance, domain name services, distributed synchronization, group services, etc.

https://www.cnblogs.com/zhangwuji/p/7594725.html
https://www.zhihu.com/question/21677041

Guess you like

Origin blog.csdn.net/qq_33431394/article/details/108703551