Summary of Hadoop knowledge points

1. Briefly describe your understanding of SafeMode mode of Hadoop cluster?

         The cluster is in safe mode and cannot perform important operations (write operations). The cluster is in a read-only state; but strictly speaking, it only guarantees access to HDFS metadata information, not file access; after the cluster is started, it automatically exits the safe mode , If the cluster is in safe mode, you need to leave safe mode if you want to complete the write operation

         Check the safe mode status: bin/hdfs dfsadmin -safemode get

         Enter the safe mode state: bin/hdfs dfsadmin -safemode enter

         Leave the safe mode state: bin/hdfs dfsadmin -safemode leave

         Wait for the safe mode status: bin/hdfs dfsadmin -safemode wait

         For a newly created HDFS cluster, the NameNode will not enter the safe mode after startup, because there is no Block information

 

2. How to set blacklist and whitelist in hadoop cluster? What are the roles?

         Host nodes added to the whitelist are allowed to access the NameNode, and host nodes not in the whitelist will be withdrawn

         Create a white.hosts file: the host name added to the file is whitelisted

         Add the white.hosts property to the hdfs-site.xml configuration file and distribute the file

         The hosts on the blacklist will be forced to quit

         Create a black.hosts file: the host name added to the file is a blacklist

         Add the black.hosts property to the hdfs-site.xml configuration file and distribute the file

        

3. Is it possible to run Hadoop on Windows?

         Can't

 

4. When processing tasks in MapReduce, briefly describe which basic process goes through?

         The concurrent instances of MapTask in the first stage run completely in parallel and are independent of each other

         The concurrent instances of ReduceTask in the second stage are not related to each other, but their data depends on the output of all concurrent instances of MapTask in the previous stage

         The MapReduce programming model can only contain one Map phase and one Reduce phase. If the user’s business logic is very complex, then only multiple MapReduce programs can be run in series.

 

5. The short answer describes how the following TextInputFormat performs file segmentation?

         After getting the file, the slice size is 128M by default. If the file is larger than 128M, then it depends on whether the file is larger than 1.1 times (140.8M) of 128M. If it is not larger, it will not be split. Just divide it according to the size of 128M

        

6. What if there is no data in Namenode?

         Copy the data in the Secondary NameNode to the directory where the NameNode stores the data

         Use the -importCheckpoint option to start the NameNode daemon to copy the data in the SecondaryNameNode to the NameNode directory

 

7. How to achieve password-free login between servers (convenient version), what encryption does SSH use?

         Generate public and private keys: ssh-keygen -t rsa

         Copy the public key to the target machine to log in without secret

         SSH uses asymmetric encryption

 

8. Briefly describe the scenarios for which MapReduce is not suitable for use. In fact, it is his shortcomings?

         Disadvantages of MapReduce: not good at real-time computing, not good at streaming computing, not good at DAG (directed graph) computing

 

9. What are the basic data types of MapReduce?

         BooleanWritable、ByteWritable、IntWritable、FloatWritable、LongWritable、DoubleWritable、         TextWritable、MapWritable、ArrayWritable

 

10. What are the components of yarn and what are their functions? What are the three main schedulers, and which one does Hadoop default?

         It is composed of components such as ResourceManager, NodeManager, ApplicationMaster and Container;

         ResourceManager: Process client requests, monitor NodeManager, start or monitor ApplicationMaster, resource allocation and scheduling

         NodeManager: Manage resources on a single node, process commands from ResourceManager, and process commands from ApplicationMaster

         ApplicationMaster: Responsible for data segmentation, applying for resources for applications and assigning them to internal tasks, task monitoring and fault tolerance

         Container: Container is a resource abstraction in YARN. It encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, and network.

         FIFO, Capacity Scheduler (capacity scheduler) and Fair Scheduler (fair scheduler)

         The default resource scheduler of Hadoop is Capacity Scheduler (capacity scheduler)

 

11. If you need to cluster the namenode nodes, how do you need to configure it?

         Password-free login between servers must be implemented

         Modify the etc/hadoop/slaves file under the hadoop2.7.2 folder

        

12. Briefly describe the function of the ring buffer in the Shuffle process?

         Key and value are output from the map() method, collected by the outputcollector, obtain the partition number through the getpartitioner() method, and then enter the ring buffer. By default, the size of the ring buffer is 100M. When the amount of map input data into the ring buffer reaches 80MB or more, the overflow writing process will start. If other data enters during the overflow writing process, the remaining 20% ​​will be written in reverse. The overflow writing process will first partition according to the key and value, and then sort. Finally, the maptask overflow file is merged and sorted and then falls into the local disk. The reduceTask copies the data of the same partition under multiple mapTasks to different reducetasks for merging and sorting. Read a set of data to the reduce() function at a time

 

13. During the execution of Mapreduce, what are the main tasks before executing Reduce Task?

         Calculate the number of MapTasks, divide the file, write data to the ring area, then partition the file quickly, overflow to the file (partitioned and in order), and then merge and sort

 

14. What are the three main attributes of hdfs-site.xml?

         Specify the number of HDFS replicas: dfs.replication

         指定SecondaryNameNode:dfs.namenode.secondary.http-address

         指定DataNode:dfs.datanode.address

 

15. What are the precautions for Hadoop fully distributed mode?

         Design the cluster: put namenode, datanode, resourcemanager, nodemanager, historyserver on different servers

         Set the server address where the namenode node is located and the storage directory of temporary files (core-site.xml)

         The number of copies of the configuration file (hdfs-site.xml)

         Placement mapred-site.xml sum yarn-site.xml

         The other nodes in the cluster are consistent with the configuration on hadoop01

        

16. 3 modes in which Hadoop cluster can run?

Local operation mode, pseudo-distributed operation mode and fully distributed operation mode

 

17. Summarize the role and significance of Combiner in one sentence? And explain the premise of using it?

         The meaning of Combiner is to locally summarize the output of each MapTask to reduce the amount of network transmission

 

18. Use the shell script to complete the following functions, find the extreme value, and output the maximum and minimum according to the input data (to the extent of handwriting)

         #!/bin/bash

         min = $ 1

         max=$1

         for i in "$@"

         do

                  if [ $min -gt $i ]

                  then

                           min = $ i

                  be

                  if [ $max -lt $i ]

                  then

                            max=$i

                  be

         done

         echo "Maximum value is" $max

         echo "The minimum value is" $min

        

19. Write a mapreduce of the most basic wordcount word statistics (to the degree of handwriting, you can refer to the code sample in class)

         public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

             private Text keyText = new Text();

             private IntWritable one = new IntWritable(1);

             @Override

             protected void map(LongWritable key, Text value, Context context) throws IOException,           InterruptedException {

                 String line = value.toString();

                 String [] fileds = line.split(" ");

                 for (String filed : fileds) {

                     keyText.set(filed);

                     context.write(keyText, one);

                 }

             }

         }

         public class WcReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

             private IntWritable total = new IntWritable();

             @Override

             protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws             IOException, InterruptedException {

                 int I = 0;

                 for (IntWritable value : values) {

                     sum += value.get();

                 }

                 total.set(sum);

                 context.write(key, total);

             }

         }

         public class WcDriver {

             public static void main(String[] args) throws IOException, ClassNotFoundException,                    InterruptedException {

                 Job job = Job.getInstance(new Configuration());

                 job.setJarByClass(WcDriver.class);

                 job.setMapperClass(WcMapper.class);

                 job.setReducerClass(WcReducer.class);

                 job.setMapOutputKeyClass(Text.class);

                 job.setMapOutputValueClass(IntWritable.class);

                 job.setOutputKeyClass(Text.class);

                 job.setOutputValueClass(IntWritable.class);

                 FileInputFormat.setInputPaths(job, new Path(args[0]));

                 FileOutputFormat.setOutputPath(job, new Path(args[1]));

                 boolean b = job.waitForCompletion(true);

                 System.exit(b ? 0 : 1);

             }

         }

 

20. List commonly used hdfs commands (at least 10, and explain the role of the command)

         Local file à HDFS:

    -put: upload local data to hdfs

    -copyFromLocal: copy local file data to hdfs

    -moveFromLocal: move local file data to hdfs, local data will be deleted after success

    -appendToFile: Append a file to the end of an existing file

         Between HDFS and HDFS:

    -ls: View the hdfs file directory

    -mkdir: create a directory on HDFS

    -rm: delete files or folders

    -rmr: delete recursively

    -cp: copy files from one directory to another

    -mv: Move files in the HDFS directory

    -chown: modify the user permissions of the file

    -chmod: modify the read and write permissions of the file

    -du -h: Temporarily used space of the folder

    -df -h: View system partition

    -cat: view files

         HFDSàLocal:

    -get: download files from hdfs to local

    -getmerge: merge files in the hdfs directory to the local

-copyToLocal: copy files from hdfs to local

 

21. List 10 commonly used linux commands and explain their functions

         View current network IP: ifconfig

         Modify the IP address: vim /etc/ssconfig/network-scripts/ifcfg-eth0

         Start the firewall: servicer network start

         Turn off the self-start of the iptables service: chkconfig iptables off

         Create an empty file: touch file name

         Add a new user: useradd username

         Display the username of the logged in user: who am i

         Change permissions: chmod 421 file or directory

         View all processes in the system: ps -aux

         Query the installed rpm package: rpm -qa | grep rpm package

Guess you like

Origin blog.csdn.net/Poolweet_/article/details/103809280