HDFS cluster
HDFS cluster is built on top of Hadoop cluster. Since HDFS is the main daemon process of Hadoop , the configuration process of HDFS cluster is the representative of Hadoop cluster configuration process.
Using Docker can build a cluster environment more conveniently and efficiently.
Configuration in each computer
How does Hadoop configure clusters, and what configurations should there be in different computers?
The remote control of the HDFS naming node to the data node is realized through SSH , so the key configuration items should be configured on the naming node, and the non-critical node configuration should be configured on each data node. That is to say, the configurations of data nodes and naming nodes can be different, and the configurations of different data nodes can also be different.
In order to facilitate the establishment of a cluster, I use the same configuration file to synchronize to all cluster nodes in the form of a Docker image
Specific steps
The general idea is: first use a Hadoop image to configure it so that all nodes in the cluster can share it, and then use it as a prototype to generate several containers to form a cluster.
configuration prototype
First, use the previously prepared hadoop_proto image to start as a container:
docker run -d --name=hadoop_temp --privileged hadoop_proto /usr/sbin/init
Enter the configuration file directory of Hadoop:
cd $HADOOP_HOME/etc/hadoop
document | effect |
---|---|
workers | Record the hostname or IP address of all data nodes |
core-site.xml | Hadoop core configuration |
hdfs-site.xml | HDFS configuration items |
mapred-site.xml | MapReduce configuration items |
yarn-site.xml | YARN configuration items |
After the prototype of the cluster is configured, exit the container and upload the container to the new image cluster_proto:
docker stop hadoop_temp
docker commit hadoop_temp cluster_proto
deploy cluster
First, create the private network hnet for the Hadoop cluster:
docker network create --subnet=172.20.0.0/16 hnet
Next create the cluster container:
docker run -d --name=nn --hostname=nn --network=hnet --ip=172.20.1.0 --add-host=dn1:172.20.1.1 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init
docker run -d --name=dn1 --hostname=dn1 --network=hnet --ip=172.20.1.1 --add-host=nn:172.20.1.0 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init
docker run -d --name=dn2 --hostname=dn2 --network=hnet --ip=172.20.1.2 --add-host=nn:172.20.1.0 --add-host=dn1:172.20.1.1 --privileged cluster_proto /usr/sbin/init
Enter the named node:
docker exec -it nn su hadoop
Format HDFS:
hdfs namenode -format
If nothing goes wrong, then the next step is to start HDFS:
start-dfs.sh
After successful startup, the jps command should be able to check the existence of NameNode and SecondaryNameNode. The DataNode process does not exist for the named nodes because this process runs in dn1 and dn2.
MapReduce uses
Word Count is "word count", which is the most classic of MapReduce work programs. Its main task is to make inductive statistics on the words in a text file, and count the total number of occurrences of each word that has appeared.
Hadoop includes many classic MapReduce sample programs, including Word Count.
Note: This case can still run when HDFS is not running, so let's test it in stand-alone mode first
First, start a new container of the previously made hadoop_proto image:
docker run -d --name=word_count hadoop_proto
into the container:
docker exec -it word_count bash
Enter the HOME directory:
cd ~
Now we prepare a text file input.txt:
I love China
I like China
I love hadoop
I like hadoop
Save the above content with a text editor.
Execute MapReduce:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.4.jar wordcount input.txt output
Explain the meaning:
The hadoop jar executes the MapReduce job from the jar file, followed by the path to the example package.
wordcount means to execute the Word Count program in the sample package, followed by these two parameters, the first is the input file, and the second is the directory name of the output result (because the output result is multiple files).
After execution, a folder output should be output, and there are two files in this folder: _SUCCESS and part-r-00000.
cluster mode
Now we run MapReduce in cluster mode.
Start the cluster container configured in the previous chapter:
docker start nn dn1 dn2
Enter the NameNode container:
docker exec -it nn su hadoop
Go to HOME:
cd ~
Edit input.txt:
I love China
I like China
I love hadoop
I like hadoop
Start HDFS:
start-dfs.sh
Create a directory:
hadoop fs -mkdir /wordcount
hadoop fs -mkdir /wordcount/input
Upload input.txt
hadoop fs -put input.txt /wordcount/input/
Execute Word Count:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.4.jar wordcount /wordcount/input /wordcount/output
View execution results:
hadoop fs -cat /wordcount/output/part-r-00000
If everything is fine, the following results will be displayed:
I 4
hadoop 2
like 2
love 2
China 2
MapReduce programming
After learning how to use MapReduce, you can already handle statistical and retrieval tasks such as Word Count, but objectively, there are still many things that MapReduce can do.
MapReduce mainly relies on developers to implement functions through programming, and developers can process data by implementing methods related to Map and Reduce.
In order to demonstrate this process simply, we manually write a Word Count program.
Note: MapReduce depends on the Hadoop library, but the Hadoop runtime environment I use is a Docker container, which is difficult to deploy a development environment, so the real development work (including debugging) will require a computer running Hadoop.
MyWordCount.java file code
/**
* 引用声明
* 本程序引用自 http://hadoop.apache.org/docs/r1.0.4/cn/mapred_tutorial.html
*/
package com.runoob.hadoop;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
/**
* 与 `Map` 相关的方法
*/
class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key,
Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
/**
* 与 `Reduce` 相关的方法
*/
class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public class MyWordCount {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(MyWordCount.class);
conf.setJobName("my_word_count");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
// 第一个参数表示输入
FileInputFormat.setInputPaths(conf, new Path(args[0]));
// 第二个输入参数表示输出
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Save the contents of this Java file to the NameNode container, suggested location:
/home/hadoop/MyWordCount/com/runoob/hadoop/MyWordCount.java
Note: According to the current situation, the JDK installed in some Docker environments does not support Chinese, so to be on the safe side, please remove the Chinese comments in the above code.
Enter the directory:
cd /home/hadoop/MyWordCount
Compile:
javac -classpath ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.1.4.jar -classpath ${HADOOP_HOME}/share/hadoop/client/hadoop-client-api-3.1.4.jar com/runoob/hadoop/MyWordCount.java
Pack:
jar -cf my-word-count.jar com
implement:
hadoop jar my-word-count.jar com.runoob.hadoop.MyWordCount /wordcount/input /wordcount/output2
View Results:
hadoop fs -cat /wordcount/output2/part-00000
output:
I 4
hadoop 2
like 2
love 2
China 2