Hadoop Big Data Development Foundation series: Three, Hadoop underlying operating

The third chapter, Hadoop underlying operating

Directory Structure:

1. Check the Hadoop cluster of basic information

    1.1 Querying the cluster storage system information

    1.2 computing resource information query system

2. Upload the file to the HDFS directory

    2.1 understand the HDFS file system

    2.2 master the basic operations of HDFS

    2.3 task realization

3. Run the first MapReduce task

    3.1 understand Hadoop official sample package

    3.2 submit to the cluster running MapReduce tasks

4. Manage multiple MapReduce tasks

    4.1 Queries MapReduce task

    4.2 interrupt MapReduce task

5. Summary

6. after-school practice


The main background: statistical users a website logins, the entire mission system includes checking Hadoop cluster resource, file storage, distributed computing task and call monitoring.

1. Check the Hadoop cluster of basic information

    Data storage Hadoop cluster is achieved by HDFS. HDFS and is composed of a plurality of DataNode NameNode composition, constitutes a distributed file system. There are generally two ways to view information HDFS file system, namely the command line and browser mode .

    Hadoop cluster computing resources are distributed among the nodes of the cluster, and to collaborate through ResourceManager NodeManager formulation. General ResourceManager can be accessed through a browser monitoring service to query Hadoop cluster computing resources.

    1.1 Querying the cluster storage system information

        (1) HDFS monitoring service accessible through NameNode port 50070, we can see the operation through this interface

        (2) via the command line can also query HDFS information. In the terminal server cluster, entering a query command

 hdfs dfsadmin -report [-live] [-dead] [-decommissioning]
  
  

            Basic information -live output node and related statistics online

            Basic information -dead output node failure and related statistics

            Basic information -decommissioning disabled nodes and related statistics

        

    1.2 computing resource information query system

        (1) computing resources can easily check the information on the current cluster by ResourceManager. Port enter 8088:

        (2) into the node view of the port resource information 8042:

        According to the information shown above, you can initially understand the current cluster computing resources, including available computing nodes on a cluster, available CPU core and memory, and each node its own CPU and memory resources.

2. Upload the file to the HDFS directory

    2.1 understand the HDFS file system

        HDFS is a separate file system like Linux, which is similar with our understanding of the operating system directory. (You can view the file directory information through the web end)

    2.2 master the basic operations of HDFS

        For basic operation HDFS file system, can be achieved through HDFS commands in a terminal, enter "hdfs dfs" command, you can complete most of the administrative operations of HDFS directories and files, including creating a new directory, upload and download files, view file contents, delete files.

        (1) Create a new directory / user / dfstest

            Hdfs dfs direct input terminal can be prompted in the command:

        We then execute the command to create a directory path:

        Note that this command can only host create a directory, so we can add -p to fill the middle of the whole directory structure does not exist:

hdfs dfs -mkdir -p /user/test/example
  
  

         Then we look at the root of what we created hdfs file directory:   


  
  
  1. -ls the DFS hdfs [path]      // because hdfs file system is different from Linux, so use the command to see hdfs 
  2. // must pay attention to what, if no path, the system will be automatically directed to the path = / home / [username], because we do not have this directory, so the system can not find, we want to specify the path, the root directory on the direct use / can be.

         (2) upload and download files

            Then create a new folder, for doing test

            ① upload files

            Testing tasks: hdfs command to use local files on the server cluster node test1.txt uploaded to the HDFS directory / user / dfstest in

            command:


  
  
  1. hdfs dfs [-copyFromLocal [-f][-p][-l] <localsrc> <dst> ]
  2. // copy the file from the local system to the HDFS file system, the main parameters <localsrc> local path, <dst> is to be copied to the destination path
  3. hdfs dfs [-moveFromLocal <localsrc> <dst>]
  4. // move the file from the local file system to the file system HDFS main parameters <localsrc> local path, <dst> is to be moved to the destination path
  5. hdfs dfs [-put [-f][-p][-l] <localsrc> <dst>]
  6. // upload files from the local file system into the HDFS file system, the main parameters <localsrc> to a local path, <dst> is to be uploaded to the destination path

            ② Download file

            Similarly, we can also download the required files from HDFS in

            command:


  
  
  1. hdfs dfs [-copyTolocal [-p][-ignoreCrc][-crc] <src> <localdst>]
  2. // copy files from HDFS file system to a local file system, the main parameters <src> for the HDFS file system path, <localdst> oriented sub-file storage path
  3. hdfs dfs [-get [-p][-ignoreCrc][-crc] <src> <localdst>]
  4. // Get the HDFS file system path specified file to the local file system, the main parameters <src> to HDFS file system path, <localdst> oriented secondary storage file path

        (3) view the file contents

            View the contents of HDFS file system:

            command:


  
  
  1. hdfs dfs [-cat [-ignoreCrc] <src>]
  2. // Check the contents of HDFS file, <src> specifies the file path
  3. hdfs dfs [-tail [-f] <file>]
  4. // Check HDFS last 1024 bytes, <file> specify a file path

        Example:

        (4) delete a file or directory


  
  
  1. hdfs dfs [-rm [-f][-r][-skipTrash] <src>]
  2. // delete files on HDFS, the main parameters for the -r recursive delete, <src> specifies the file path to delete
  3. hdfs dfs [-rmdir [--ignore-fail-on-non-empty] <dir>]
  4. // If you delete a directory, you can use this command, the main parameters <dir> specifies the directory path

            Example:

    2.3 task realization

        (1) the master is transferred to the next email_log.txt / testhadoop directory

        (2) upload files to HDFS file system / user / dftest directory

        (3) in the web end Click email_log.txt check the file contents

            We can see that the file be divided into two memory blocks (file size 216M), each block has three copies of the file, they are stored in three different data nodes.

3. Run the first MapReduce task

    Requirements for the data file in the file directory HDFS /user/root/email_log.txt calculation process, the number of statistics for each user login, i.e. the number may be equal to each email appears obtained, may be further abstracted as statistics for each frequency word appears. On the Hadoop cluster package is executed, that is, submit a MapReduce task, usually hadoop jar command is completed.

    3.1 understand Hadoop official sample package

        Cluster server local directory: "$ HADOOP_HOME / share / hadoop / mapreduce /" sample package can be found in "hadoop-mapreduce-examples-2.7.7.jar", this package encapsulates some common test module, as follows :

multifilewc

The number of words in multiple files statistics

pi

Application algorithm quasi-Monte Carlo estimates of pi π

randomtextwriter

A 10GB randomly generated node in each data text file

wordcount

Enter the word file's frequency statistics

wordmean

Computing the average length of words in the input file

wordmedian

The median calculation word length of the input file

wordstandarddeviation

Difference calculating a standard word length of the input file

        In this test, the use of data email_log.txt wordcount file to log the number of statistics.

    3.2 submit to the cluster running MapReduce tasks

       (1) submit MapReduce tasks, hadoop jar command usage:

hadoop jar <jar> [mainClass] args

        Examples will be described below with reference to individual parameters:


  
  
  1. hadoop jar hadoop-mapreduce-examples- 2.7.7.jar wordcount /user/dftest/email_log.txt /user/dftest/output
  2. // <jar> is the location jar package, [mainClass mean it has been packaged using class], args can specify and read the file storage directory number of the output file.

        (2) the operation log as follows:

        (3) results View:

        We can see two new files generated in the output file: one is _SUCCESS, which is an identification document, indicating that the task execution is completed; the other is a part-r-00000, that is, after the completion of the task execution results produced file.

        View part-r-00000 Content:

        ^ _ ^ Basically completed the task.

4. Manage multiple MapReduce tasks

    Hadoop is a multi-tasking system, multiple users can simultaneously, a plurality of job processing a plurality of data sets. For a number of tasks submitted to the Hadoop cluster, how to manage it, such as: How do I know which tasks to complete the cluster; the execution result is a success or failure; how the actual implementation of the inspection tasks; if a task execution time is too long, how break it?

    4.1 Queries MapReduce task

        (1) PI class used in the present exemplary embodiment to perform a package of the value of the estimated π


  
  
  1. hadoop jar hadoop-mapreduce-examples- 2.7.7.jar pi 10 100
  2. // two parameters representative of the number of times the calculation of each rear Map Map, the greater the value of the parameter, the higher the accuracy of the calculated results

        (2) the execution log:

        (3) MapReduce tasks to see just submitted a computing resource usage:

        ① On this page we can see real-time usage of cluster resources (because the execution is completed, the parameters for the initial parameters)

        ② MapReduce task list Explorer display:

        ③ For more information see the task:

        (4) submitted to multi-task:

            And start a server terminal two clusters sequentially submit two jobs: wordcount values ​​of pi and

            We can observe the use of computing resources on the cluster, there is a job running, taking up most of the computing resources, another job waiting state (waiting for allocation of computing resources to it, when computing resources are met, it will start carried out).

    4.2 interrupt MapReduce task

        MapReduce task has been submitted to, and in some special cases it needs to be interrupted, for example, found that abnormal procedures, a task execution time is too long, taking up a lot of computing resources.

        Can "Kill Application", select the option to be determined by the above web interface page, select Application interface to interrupt the task, refresh the page again can see that the task has been terminated, the waiting task execution.

5. Summary

    This chapter describes the operation of Hadoop knowledge base, combined with the actual tasks and multiple instances of the operating system and I file Hadoop cluster computing resources, as well as submit MapReduce tasks have a preliminary understanding.

6. Small exercises

    (1) to submit MapReduce hadoop jar task if the command line specified output directory already exists, implementation of the results will be: C

        A. B. overwrite the original directory automatically create a new directory C. error and terminate the task D. None of the above   

 

3, file upload and download files, upload files in three ways 1, the system will copy the files to the local file system, HDFS hdfs dfs -mkdir / user / dfstest hdfs dfs -copyFromLocal /opt/email_log.txt / user / dfstest hdfs dfs -copyFromLocal local path hdfs path
2、将本地系统文件移动到HDFS文件系统中						
 hdfs dfs -moveFromLocal /opt/a.txt /user/dfstest						
						
3、将本地系统文件上传到HDFS文件系统中						
hdfs dfs -put /opt/c.txt /user/dfstest						
						
注意:上传文件三种方式都可以在hdfs目录下修改文件名						例如:hdfs dfs -put /opt/c.txt /user/dfstest/m.txt
						
						
下载文件两种方式						
1、将文件从HDFS文件系统复制到本地文件系统						
hdfs dfs -copyToLocal /user/dfstest/m.txt /opt/					hdfs dfs -copyToLocal  hdfs路径 本地路径	
						
2、获取HDFS文件系统上指定路径的文件到本地文件系统						
hdfs dfs -get /user/dfstest/n.txt /opt						

4, view the contents of the file
1, view HDFS file contents
hdfs dfs -cat /user/dfstest/a.txt

2、输出HDFS文件最后1024字节						
hdfs dfs -tail /user/dfstest/a.txt						

5, delete files (files or directories)
1, delete files on HDFS
#hdfs the DFS -rm file path Note: The main parameters -r for recursive delete
hdfs dfs -rm -r directory path

2、删除HDFS上的目录						
#hdfs dfs -rmdir 目录路径			目录路径下内容不为空时无法删除			

 

The next chapter will be an introduction to the MapReduce programming ^ _ ^.

Published 18 original articles · won praise 0 · Views 452

The third chapter, Hadoop underlying operating

Guess you like

Origin blog.csdn.net/weixin_45678149/article/details/104938538