Hadoop first running instance wordcount

Quote:

There have been a lot of things in the past few weeks. I haven't written a blog for two weeks. This week I finally put the hadoop instance up and running, and then ran the official wordcount example (used to count the number of occurrences of words in the file).
Next is the record of my successful running instance. The premise of operation is to install and configure hadoop (you can refer to my previous blog: hadoop pseudo-distributed installation record )

Operation steps:

1. First prepare a file containing words, and then upload this file to the linux server.
document content:

hello world hello hadoop
abc hadoop aabb hello word
count test hdfs mapreduce

2. Use hdfs command to create a directory of input files (hfds command is basically the same as linux, you can go to the official website to check) hadoop fs -mkdir /input/wordcount
and then create an output directory /output for subsequent hadoop storage and operation result

3. Then put the file into hadoop's file system hadoop fs -put /home/file1 /input/wordcount is
created, you can use ls to check if the file exists hadoop fs -ls -R /

4. Then enter the share/hadoop/mapreduce of hadoop, there is a hadoop-mapreduce-examples-3.1.2.jar
through hadoop jar hadoop-mapreduce-examples-3.1.2.jar, you can view this official example What programs can be executed
are as follows:

You can see a lot of built-in programs, we use wordcount here.
Excuting an order

hadop jar hadoop-mapreduce-examples-3.1.2.jar /input/wordcount /output/wordcount

One of the last two parameters is the input path of the file, which is the path that we created before hdfs, and the second parameter is the output path of the file.
If not, Hadoop will create it by itself.
5. Then first, the process of map will be carried out. In the process of reducing, here can be understood as the step of divide and conquer. Map is the intermediate result of processing files on multiple machines, and then the results are summarized through reduce (reduction, aggregation).
Moreover, the map is executed before the reduce is executed.

6. Go to the output file to view the result, there will be three files in output/wordcount, one with part is the output result, you can use hadoop fs -cat output file path to view the result

to sum up:

Although it seems that there are not many steps and the content is relatively simple, there are still many pitfalls. Points to note:
1. For pseudo-distributed hadoop, the hostname must be set up and consistent with the configuration file. If it doesn’t work, just specify 127.0.0.1 directly (I solved it this way anyway)
2. Yarn memory configuration Reasonable. If it is too small, it will be stuck in the link of running job or stuck in map 0%. At this time, go to the yarn-site to set the memory size (according to the actual server memory setting, I set it after 2048M
That’s fine ) 3. If you find that you are stuck in a certain link, remember to check the logs in the hadoop installation directory. There are many log types, including nodeManageer, resourceManager, etc., but the execution will not work. There will be corresponding logs and prompts in the logs. Can help find the problem.

Guess you like

Origin blog.csdn.net/sc9018181134/article/details/99710553