Hadoop's big data platform foundation (1)

1. What is Hadoop?

Hadoop is an open source distributed platform that uses the HDFS file system (Hadoop Distributed File System) and the MapReduce computing framework as its core.

HDFS is designed to enable users to implement distributed systems using connected hardware platforms. MapReduce enables developers to achieve concurrent and distributed distribution without having to care too much about the underlying computing framework logic.

 

2. The role of Hapoop

Hadoop is used to process massive data. For example, Yahoo uses it as a web search and advertising system; Baidu uses it as a search log analysis system and web data mining; Ali uses it to store massive transaction data; Mobile uses it for data analysis.

 

3. Advantages of Hapoop

1. High reliability: correct data processing

2. High scalability: any computer cluster node can be added or deleted at any time.

3. Efficiency: fast processing of massive data

4. High fault tolerance: The error of a computing node will not affect the final result.

 

4. Hadoop project structure diagram

Hadoop project:

Architecture: Ambari is used as a web-side management tool for Hadoop creation, management and monitoring, and Zookeeper is used as project collaboration service and failover

failover service.

High-level programming languages: Pig and Hive, used for data statistics query function, Pig for ETL process processing and data model, Hive analytical query

Hadoop core computing engine: MapReduce underlying processing algorithm

The underlying storage database: HBase high reliability, high performance, column-oriented and scalable distributed structured database, Hcatalog metadata service

HDFS: Physical Storage File System

 

Five, hadoop architecture

1. HDFS file system architecture

HDFS uses the Master/Slave master-slave structure model. Namenode acts as the master server to manage the system namespace and client heap file access operations, while Datanode manages stored data. One master runs the Namenode and the other slaves each run a Datanode

 

2. MapReduce computing architecture system

The essence of MapReduce is a simple medical parallel transformation framework. It also adopts the master-slave model, which is composed of a single running master node JobTracker and N running slave nodes TaskTracker. JobTracker is used to manage and schedule jobs, and TasjTracker is used for specific task execution.

 1) Execution process: Each computing task can be divided into two stages, Map stage and Reduce stage

  The Map stage accepts a set of input from the key-value team mode <key, value>, and generates intermediate outputs that are also the key-value team mode <key_m, value_m>; the Redeuce stage is responsible for processing the intermediate output <key_m, value_m> generated by receiving Map , process these results and output the results

sample: Use the simplest WordCount to count the number of occurrences of each word in the text, then each map task is responsible for extracting all the words in the text and generating n <word, 1> intermediate outputs; and the Reduce task can process these intermediate outputs, Convert to <word, n> final output.

In this process, the intermediate output generated by Map is directly stored on the local disk, and the job will be deleted after completion; the final output generated by Reduce will be stored in HDFS.

 

 2) How Hadoop Streaming Works

The application that executes Map will first read the input, divide it into lines and process it as the standard in of the program, take the content before the first tab character of each line as the key, and the content after it as the value, if there is no tab character , then everything in this line will be used as the key, and the value will be empty.

 

 3) Data flow and control flow of MapReduce

 The jobTracker (master) first schedules and assigns tasks to the Map. In the MapPhase phase, the file content is read from the inputFiles source file. After each worker reads and splits the processing, it is temporarily stored locally as an intermediate (intermediate files on local disk). ), notify the master (jobTracker) to complete the map phase, then the master starts the Reduce phase, and through the shuffle shuffling phase rules, after each worker is assigned to a different number of tasks, the running results are written to OutPut Files

  <1>Master (JobTracker) is responsible for allocating the following workers (TaskerTracker) within the day

  <2> When each worker is executing, it will always return the progress report and the specific progress stage on time, and the master records the management progress

  <3> Once the worker fails, the master will assign these failed tasks to the new worder

 

Six, WordCount word statistics program

1. Count how often each word appears. The output is sorted alphabetically by word.

2. For the sample, we can see according to the principle:

 (1) In the Map stage, each node completes the work from input data to word splitting to word collection, and automatically sorts it

 (2) The shuffle phase completes the aggregation of the same words and distributes them to each Reduce node worker (the shuffle phase is the default process of MapReduce, after the Map phase is completed)

 (3) The Reduce phase is responsible for receiving statistical words

 

By default, the intermediate key values ​​passed in the shuffle phase are sent to each worker in the Reduce phase. Although they are sorted by default in the Map phase, this default sorting is only ordered within the nodes of each individual worker, and the overall reduce is still globally. It is out of order. To ensure that the overall reduce is in order, the algorithm logic for assigning Reduce nodes must be rewritten: Rewrite the Partition method:

  Before assigning reduce after the shuffle phase, the key in the output key-value team in the middle of the map will be determined to assign the key-value team to the Partition interval, that is, the worker node of the reduce. When the final result is required to be in order, as long as the partition of the reduce decision is allocated in order,

 

7. Map/Reduce execution process

The overall process of execution can be summarized as: code writing -> job configuration -> job submission -> Map task assignment and execution -> processing intermediate results shuffle -> Reduce task execution and assignment -> ok.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326452984&siteId=291194637