One article takes you to understand Hadoop briefly

Hadoop is a basic platform for distributed data storage and computing developed and hosted by the Apache Foundation . Users can develop distributed programs without understanding the underlying details of distributed technology, and use the capabilities of clusters to achieve stable storage and fast calculations .

1. Big data background and characteristics

What is big data

Gartner, a well-known data scientist, believes that "big data" is a massive, high-growth, and diversified information asset that requires new processing models to have stronger decision-making power, insight and process optimization capabilities. Therefore, the term big data represents two meanings: one is the massive information assets , which is the data itself; the other is the new processing model , which is the "big data" processing technology.


Characteristics of big data

IBM puts forward the five characteristics of big data: Volume (large) , Velocity (high speed) , Variety (diversity) , Value (low value density) , Veracity (authenticity) .

  • Volume : The data collection scale, storage scale, and calculation scale are all very large. Generally obtained in TB, PB (1024TB) as the unit of measurement.
  • Velocity (high speed) : The speed of data generation is fast, the processing speed is fast, and the timeliness is high.
  • Variety : Variety of types and sources. Specifically embodied in structured, semi-structured and unstructured, such as logs, audio and video, pictures, geographic information, etc. For different types of data, processing techniques are not the same.
  • Value (low value density) : The perception of information is ubiquitous, but in the face of massive information, only a few have real value. How to find gold in trash through data mining is a problem that needs to be addressed in the era of big data.
  • Veracity (authenticity) : It is pointed out in the "Big Data Era" that big data does not use a shortcut such as random analysis (sampling survey), but analyzes and processes the entire amount of data. The full amount of data reflects the objectivity and authenticity of things.

2. Common problems in big data processing

Important issues facing big data processing: computing power and storage .

  1. In terms of computing power , traditional software is deployed on a certain computer, then this computer will provide computing resources for the software. Even if a service-oriented architecture or even a microservice architecture is adopted for distributed deployment, it is inevitable that one computing task is The fact that it can run on a computer. At the same time, traditional software concentrates data from the network to the computer where the current program is patrolled for processing. Here, the network transmission delay of data is involved. The traditional software " data closer to calculation " model limits the computing power.
    .
  2. In terms of storage , under normal circumstances, many companies centrally manage data in order to reduce costs. With the passage of events, when the data generated by the software system cannot be stored on a single computer, it is necessary to start distributed storage . Then the question ensues, how does the software address when data is needed ; how to tolerate data faults in large-scale computer clusters where one computer fails .

3. Getting started with Hadoop

Hadoop (2.x and above) mainly provides HDFS to realize the function of data storage , and also provides YARN to realize the function of computing resource scheduling , and uses the MapReduce framework to do distributed operations .
GFS is a scalable distributed file system used for large-scale, distributed applications that access large amounts of data. It runs on cheap common hardware and provides fault tolerance. HDFS was first implemented with reference to GFFS. HDFS is designed as a distributed file system suitable for running on general-purpose hardware. This is a highly fault-tolerant system that can provide high-throughput data access. It is very suitable for large-scale data sets and is mainly used to solve data storage and management. problem. MapReduce is a distributed computing framework for parallel processing of big data. The scheduling of cluster resources and the coordination of computing tasks are completed by YARN. MapReduce runs on the cluster to solve the problem of computing power.


4. Hadoop main modules

Hadoop Comment module

The Hadoop Comment module is a basic module that provides common tools for other modules, such as network communication, permission authentication, file system abstract columns, log components and other tools.


Hadoop Distributed File System模块

Hadoop Distributed File System is abbreviated as HDFS, which means Hadoop Distributed File System.
HDFS is suitable for data storage in big data scenarios because it provides data storage and access services with high reliability , high scalability and high throughput . High reliability is achieved through its own copy mechanism. High scalability is achieved by adding machines to the cluster to achieve linear expansion. High throughput means that when reading files, HDFS will feed back the target data at the location closest to the submitted task node to the application.
The basic principle of HDFS is to block data files according to a specified size (the default is 128MB above Hadoop 2.x) , and store the data in copies on multiple computers in the cluster. When a node fails, resulting in data loss, there are corresponding copies in other nodes. So when you read the file, you can still get complete feedback. The way HDFS cuts data and stores redundant copies is something that developers cannot feel. When using it, the developer only knows that he uploaded a file to HDFS.
HDFS consists of three daemons: Namenode, Secondary, Namenode and Datanode. Datanode is responsible for storing the specific content of data files , and Namenode is responsible for recording the metadata information of these files . The more data files on the Datanode, the more metadata information the Namenode stores, which will slow down the loading speed of the NamenodeTherefore, HDFS is not suitable for storing a large number of small files, but is suitable for storing a small number of large files.


Hadoop YARN module

The full name of YARN is Yet Another Resource Negotiator (another resource coordinator). YARN is separated from the resource management and job scheduling functions of JobTracker in the Hadoop 1.x version . The purpose is to solve the problem of only running MapReduce programs in Hadoop 1.x.
With the development of Hadoop, YARN has now become a general resource scheduling framework, bringing great advantages to Hadoop in the unified management of cluster resources and data sharing . Different types of jobs can also be run on YARN, such as the very popular Spark and Tez.
YARN consists of two daemons: ResourceManager and NodeManager. ResourceManager is mainly responsible for resource scheduling , monitoring task running status and cluster health status . NodeManager is mainly responsible for the specific execution of tasks and monitoring the resource usage of tasks , and regularly reports the current status to ResourceManager .


Hadoop MapReduce module

MapReduce is a parallel computing framework and a programming model, which is mainly used for logical processing of large-scale data sets . The name of MapReduce also represents its two operations: Map (mapping) and Reduce (statute). Its advantage is that even programmers who have no experience in developing distributed applications can quickly develop distributed programs through MapReduce related interfaces, without having to pay attention to the underlying details of parallel computing.
A MapReduce job, by default, divides the input data into multiple independent data blocks . Each block reads will generate a Map task. Reading and executing Map tasks are parallel . Map outputs data in the form of key/value pairs. During the output process, the keys are hashed, then sorted, and stored in key value partitions, and the same key value is stored in a partition. A partition is generally the same as the HDFS file block size, so a partition does not span different machines under normal circumstances, which prevents the output of the Map from moving between different machine nodes.
When executing a MapReduce task, the system will distribute the MapReduce code to the corresponding node or the nearest node according to the location of the target data, thereby avoiding or minimizing the consumption of data on the network transmission. It should be that the cost of mobile computing is often lower than the cost of mobile data, which is " computing closer to data ."
After the Map task is executed, the Map output is merged in advance by setting the Combiner to reduce the output of Map intermediate results. After the Map or Combiner is executed, Reduce will uniformly process the intermediate results generated by different tasks, and perform the final calculation, and finally output the results that the user wants.
Under normal circumstances, the intermediate results output by Map will directly fall on the disk, thereby reducing the burden of HDFS on file management. After the Reduce task is executed, the intermediate data will be deleted, and then the output result will be stored on HDFS.


Hadoop Ozone module

Before Hadoop 3.x, users could only use HDFS for storage, but the design principle of HDFS was not suitable for storing small files , and Ozone just solved this problem. Ozone relying HDDS plug-in, using the key (key-object) model to achieve the storage of small files and a high-speed reading .

Guess you like

Origin blog.csdn.net/qq_43965708/article/details/112604531