Hadoop summary (below)

HDFS cluster

HDFS cluster is built on top of Hadoop cluster. Since HDFS is the main daemon process of Hadoop , the configuration process of HDFS cluster is the representative of Hadoop cluster configuration process.

Using Docker can build a cluster environment more conveniently and efficiently.

Configuration in each computer

How does Hadoop configure clusters, and what configurations should there be in different computers?

The remote control of the HDFS naming node to the data node is realized through SSH , so the key configuration items should be configured on the naming node, and the non-critical node configuration should be configured on each data node. That is to say, the configurations of data nodes and naming nodes can be different, and the configurations of different data nodes can also be different.

In order to facilitate the establishment of a cluster, I use the same configuration file to synchronize to all cluster nodes in the form of a Docker image

Specific steps

The general idea is: first use a Hadoop image to configure it so that all nodes in the cluster can share it, and then use it as a prototype to generate several containers to form a cluster.

configuration prototype

First, use the previously prepared hadoop_proto image to start as a container:

docker run -d --name=hadoop_temp --privileged hadoop_proto /usr/sbin/init

Enter the configuration file directory of Hadoop:

cd $HADOOP_HOME/etc/hadoop
document effect
workers Record the hostname or IP address of all data nodes
core-site.xml Hadoop core configuration
hdfs-site.xml HDFS configuration items
mapred-site.xml MapReduce configuration items
yarn-site.xml YARN configuration items

After the prototype of the cluster is configured, exit the container and upload the container to the new image cluster_proto:

docker stop hadoop_temp
docker commit hadoop_temp cluster_proto

deploy cluster

First, create the private network hnet for the Hadoop cluster:

docker network create --subnet=172.20.0.0/16 hnet

Next create the cluster container:

docker run -d --name=nn --hostname=nn --network=hnet --ip=172.20.1.0 --add-host=dn1:172.20.1.1 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init
docker run -d --name=dn1 --hostname=dn1 --network=hnet --ip=172.20.1.1 --add-host=nn:172.20.1.0 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init
docker run -d --name=dn2 --hostname=dn2 --network=hnet --ip=172.20.1.2 --add-host=nn:172.20.1.0 --add-host=dn1:172.20.1.1 --privileged cluster_proto /usr/sbin/init

Enter the named node:

docker exec -it nn su hadoop

Format HDFS:

hdfs namenode -format

If nothing goes wrong, then the next step is to start HDFS:

start-dfs.sh

After successful startup, the jps command should be able to check the existence of NameNode and SecondaryNameNode. The DataNode process does not exist for the named nodes because this process runs in dn1 and dn2.

MapReduce uses

Word Count is "word count", which is the most classic of MapReduce work programs. Its main task is to make inductive statistics on the words in a text file, and count the total number of occurrences of each word that has appeared.

Hadoop includes many classic MapReduce sample programs, including Word Count.

Note: This case can still run when HDFS is not running, so let's test it in stand-alone mode first

First, start a new container of the previously made hadoop_proto image:

docker run -d --name=word_count hadoop_proto

into the container:

docker exec -it word_count bash

Enter the HOME directory:

cd ~

Now we prepare a text file input.txt:

I love China
I like China
I love hadoop
I like hadoop

Save the above content with a text editor.

Execute MapReduce:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.4.jar wordcount input.txt output

Explain the meaning:

The hadoop jar executes the MapReduce job from the jar file, followed by the path to the example package.

wordcount means to execute the Word Count program in the sample package, followed by these two parameters, the first is the input file, and the second is the directory name of the output result (because the output result is multiple files).

After execution, a folder output should be output, and there are two files in this folder: _SUCCESS and part-r-00000.

cluster mode

Now we run MapReduce in cluster mode.

Start the cluster container configured in the previous chapter:

docker start nn dn1 dn2

Enter the NameNode container:

docker exec -it nn su hadoop

Go to HOME:

cd ~

Edit input.txt:

I love China
I like China
I love hadoop
I like hadoop

Start HDFS:

start-dfs.sh

Create a directory:

hadoop fs -mkdir /wordcount
hadoop fs -mkdir /wordcount/input

Upload input.txt

hadoop fs -put input.txt /wordcount/input/

Execute Word Count:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.4.jar wordcount /wordcount/input /wordcount/output

View execution results:

hadoop fs -cat /wordcount/output/part-r-00000

If everything is fine, the following results will be displayed:

I       4
hadoop  2
like    2
love    2
China   2

MapReduce programming

After learning how to use MapReduce, you can already handle statistical and retrieval tasks such as Word Count, but objectively, there are still many things that MapReduce can do.

MapReduce mainly relies on developers to implement functions through programming, and developers can process data by implementing methods related to Map and Reduce.

In order to demonstrate this process simply, we manually write a Word Count program.

Note: MapReduce depends on the Hadoop library, but the Hadoop runtime environment I use is a Docker container, which is difficult to deploy a development environment, so the real development work (including debugging) will require a computer running Hadoop.

MyWordCount.java file code

/**
 * 引用声明
 * 本程序引用自 http://hadoop.apache.org/docs/r1.0.4/cn/mapred_tutorial.html
 */
package com.runoob.hadoop;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
/**
 * 与 `Map` 相关的方法
 */
class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
    
    
   private final static IntWritable one = new IntWritable(1);
   private Text word = new Text();
   public void map(LongWritable key,
               Text value,
               OutputCollector<Text, IntWritable> output,
               Reporter reporter)
         throws IOException {
    
    
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
    
    
         word.set(tokenizer.nextToken());
         output.collect(word, one);
      }
   }
}
/**
 * 与 `Reduce` 相关的方法
 */
class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
    
    
   public void reduce(Text key,
                  Iterator<IntWritable> values,
                  OutputCollector<Text, IntWritable> output,
                  Reporter reporter)
         throws IOException {
    
    
      int sum = 0;
      while (values.hasNext()) {
    
    
         sum += values.next().get();
      }
      output.collect(key, new IntWritable(sum));
   }
}
public class MyWordCount {
    
    
   public static void main(String[] args) throws Exception {
    
    
      JobConf conf = new JobConf(MyWordCount.class);
      conf.setJobName("my_word_count");
      conf.setOutputKeyClass(Text.class);
      conf.setOutputValueClass(IntWritable.class);
      conf.setMapperClass(Map.class);
      conf.setCombinerClass(Reduce.class);
      conf.setReducerClass(Reduce.class);
      conf.setInputFormat(TextInputFormat.class);
      conf.setOutputFormat(TextOutputFormat.class);
      // 第一个参数表示输入
      FileInputFormat.setInputPaths(conf, new Path(args[0]));
      // 第二个输入参数表示输出
      FileOutputFormat.setOutputPath(conf, new Path(args[1]));
      JobClient.runJob(conf);
   }
}

Save the contents of this Java file to the NameNode container, suggested location:

/home/hadoop/MyWordCount/com/runoob/hadoop/MyWordCount.java

Note: According to the current situation, the JDK installed in some Docker environments does not support Chinese, so to be on the safe side, please remove the Chinese comments in the above code.

Enter the directory:

cd /home/hadoop/MyWordCount

Compile:

javac -classpath ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.1.4.jar -classpath ${HADOOP_HOME}/share/hadoop/client/hadoop-client-api-3.1.4.jar com/runoob/hadoop/MyWordCount.java

Pack:

jar -cf my-word-count.jar com

implement:

hadoop jar my-word-count.jar com.runoob.hadoop.MyWordCount /wordcount/input /wordcount/output2

View Results:

hadoop fs -cat /wordcount/output2/part-00000

output:

I       4
hadoop  2
like    2
love    2
China   2

Guess you like

Origin blog.csdn.net/weixin_44659309/article/details/132394618