What are the core components of Hadoop? Please briefly describe their role.

What are the core components of Hadoop? Please briefly describe their role.

Hadoop is an open source distributed computing framework for processing large-scale data sets. Its core components include Hadoop Distributed File System (HDFS) and MapReduce execution framework. Below I will detail the role of these two core components.

  1. Hadoop Distributed File System (HDFS):

    • HDFS is Hadoop's storage system used to store large-scale data sets. It is a distributed file system that can store data on multiple machines in a cluster and provides high reliability and fault tolerance.
    • HDFS divides large files into multiple data blocks and distributes and stores these data blocks on different machines in the cluster. There are multiple copies of each data block to provide data redundancy and fault tolerance.
    • HDFS achieves data locality by moving data blocks close to computing nodes, thereby improving the efficiency of data access.
    • HDFS also provides high-throughput data access and is suitable for batch processing and large-scale data analysis.
  2. MapReduce execution framework:

    • MapReduce is Hadoop's computing framework for processing and analyzing large-scale data sets. It divides computing tasks into two stages: Map stage and Reduce stage.
    • The Map phase is responsible for converting input data into key-value pairs and generating intermediate results. Each Map task independently processes a subset of the input data and generates intermediate results.
    • The Reduce phase is responsible for aggregating and calculating intermediate results and generating the final result. Each Reduce task processes the intermediate results generated by one or more Map tasks.
    • The MapReduce execution framework automatically handles details such as task allocation, scheduling, fault tolerance, and data transmission, allowing developers to focus on writing business logic.
    • The MapReduce execution framework is highly scalable and fault-tolerant, can handle large-scale data sets, and automatically re-execute tasks when computing nodes fail.

The following is an example code that uses Hadoop's MapReduce framework to count the occurrences of each word in an input text file:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
    
    

  // Mapper class
  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    
    

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
    
    
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
    
    
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  // Reducer class
  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    
    
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
    
    
      int sum = 0;
      for (IntWritable val : values) {
    
    
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    
    
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

In the above example, we have defined a Java class called WordCount. It contains a Mapper class (TokenizerMapper) and a Reducer class (IntSumReducer). The Mapper class is responsible for splitting the input text data into words and using each word as a key, setting the value to 1. The Reducer class is responsible for summing the counts of the same words and outputting the result.

In the main() function, we create a Job object and set the name of the job, the Mapper and Reducer classes, and the input and output data types. We also specify the input and output paths and call the job.waitForCompletion() method to run the job.

With appropriate input data and custom Mapper and Reducer classes, we can handle various types of large-scale data and perform corresponding analysis and calculations. Using Hadoop's distributed file system HDFS and computing framework MapReduce, we can build a highly reliable and scalable big data processing system.

Guess you like

Origin blog.csdn.net/qq_51447496/article/details/132758214