What is Hadoop? Please briefly explain its architecture and components.

What is Hadoop? Please briefly explain its architecture and components.

Hadoop is an open source distributed computing framework for processing and storing large-scale data sets. It is designed to provide high fault tolerance on cheap hardware and to be able to handle large amounts of data. Hadoop's architecture consists of two core components: Hadoop Distributed File System (HDFS) and MapReduce.

  1. Hadoop Distributed File System (HDFS):
    HDFS is Hadoop’s distributed file system used to store large-scale data sets. It is designed to provide high fault tolerance on cheap hardware. HDFS splits large files into multiple blocks and distributes storage across multiple machines. This enables high data reliability and scalability. The architecture of HDFS includes the following components:
  • NameNode: NameNode is the master node of HDFS and is responsible for managing the namespace of the file system and the metadata of storage blocks. It maintains the directory tree and file block location information of the entire file system.
  • DataNode: DataNode is the working node of HDFS and is responsible for storing actual data blocks. It receives instructions from the NameNode and manages locally stored data blocks. DataNode is also responsible for data replication and fault tolerance.
  • Secondary NameNode: The Secondary NameNode is the auxiliary node of the NameNode. It is responsible for regularly merging and checking the edit log of the file system, and generating new image files. It helps reduce the load pressure on the NameNode and improves the reliability of the system.
  1. MapReduce:
    MapReduce is the computing model and execution framework of Hadoop. It divides computing tasks into two stages: Map stage and Reduce stage. The architecture of MapReduce includes the following components:
  • JobTracker: JobTracker is the master node of MapReduce and is responsible for scheduling and monitoring job execution. It receives job requests from clients and assigns jobs to available TaskTrackers for execution.
  • TaskTracker: TaskTracker is the working node of MapReduce and is responsible for performing specific tasks. It receives instructions from JobTracker and runs Map tasks and Reduce tasks. TaskTracker is also responsible for monitoring the progress and status of tasks and reporting the results to JobTracker.
  • Map task: The Map task is the first stage of MapReduce and is responsible for converting input data into the form of key-value pairs and generating intermediate results. Each Map task independently processes a subset of the input data and generates intermediate results.
  • Reduce task: The Reduce task is the second stage of MapReduce and is responsible for aggregating and calculating intermediate results and generating the final result. Each Reduce task processes the intermediate results generated by one or more Map tasks.

The following is an example code that uses Hadoop's MapReduce framework to count the occurrences of each word in an input text file:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
    
    

  // Mapper class
  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    
    

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
    
    
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
    
    
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  // Reducer class
  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    
    
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
    
    
      int sum = 0;
      for (IntWritable val : values) {
    
    
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    
    
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

In the above example, we have defined a Java class called WordCount. It contains a Mapper class (TokenizerMapper) and a Reducer class (IntSumReducer). The Mapper class is responsible for splitting the input text data into words and using each word as a key, setting the value to 1. The Reducer class is responsible for summing the counts of the same words and outputting the result.

In the main() function, we create a Job object and set the name of the job, the Mapper and Reducer classes, and the input and output data types. We also specify the input and output paths and call the job.waitForCompletion() method to run the job.

With appropriate input data and custom Mapper and Reducer classes, we can handle various types of large-scale data and perform corresponding analysis and calculations. Using Hadoop's distributed file system HDFS and computing framework MapReduce, we can build a highly reliable and scalable big data processing system.

Guess you like

Origin blog.csdn.net/qq_51447496/article/details/132747754