1. Briefly describe your understanding of SafeMode mode of Hadoop cluster?
The cluster is in safe mode and cannot perform important operations (write operations). The cluster is in a read-only state; but strictly speaking, it only guarantees access to HDFS metadata information, not file access; after the cluster is started, it automatically exits the safe mode , If the cluster is in safe mode, you need to leave safe mode if you want to complete the write operation
Check the safe mode status: bin/hdfs dfsadmin -safemode get
Enter the safe mode state: bin/hdfs dfsadmin -safemode enter
Leave the safe mode state: bin/hdfs dfsadmin -safemode leave
Wait for the safe mode status: bin/hdfs dfsadmin -safemode wait
For a newly created HDFS cluster, the NameNode will not enter the safe mode after startup, because there is no Block information
2. How to set blacklist and whitelist in hadoop cluster? What are the roles?
Host nodes added to the whitelist are allowed to access the NameNode, and host nodes not in the whitelist will be withdrawn
Create a white.hosts file: the host name added to the file is whitelisted
Add the white.hosts property to the hdfs-site.xml configuration file and distribute the file
The hosts on the blacklist will be forced to quit
Create a black.hosts file: the host name added to the file is a blacklist
Add the black.hosts property to the hdfs-site.xml configuration file and distribute the file
3. Is it possible to run Hadoop on Windows?
Can't
4. When processing tasks in MapReduce, briefly describe which basic process goes through?
The concurrent instances of MapTask in the first stage run completely in parallel and are independent of each other
The concurrent instances of ReduceTask in the second stage are not related to each other, but their data depends on the output of all concurrent instances of MapTask in the previous stage
The MapReduce programming model can only contain one Map phase and one Reduce phase. If the user’s business logic is very complex, then only multiple MapReduce programs can be run in series.
5. The short answer describes how the following TextInputFormat performs file segmentation?
After getting the file, the slice size is 128M by default. If the file is larger than 128M, then it depends on whether the file is larger than 1.1 times (140.8M) of 128M. If it is not larger, it will not be split. Just divide it according to the size of 128M
6. What if there is no data in Namenode?
Copy the data in the Secondary NameNode to the directory where the NameNode stores the data
Use the -importCheckpoint option to start the NameNode daemon to copy the data in the SecondaryNameNode to the NameNode directory
7. How to achieve password-free login between servers (convenient version), what encryption does SSH use?
Generate public and private keys: ssh-keygen -t rsa
Copy the public key to the target machine to log in without secret
SSH uses asymmetric encryption
8. Briefly describe the scenarios for which MapReduce is not suitable for use. In fact, it is his shortcomings?
Disadvantages of MapReduce: not good at real-time computing, not good at streaming computing, not good at DAG (directed graph) computing
9. What are the basic data types of MapReduce?
BooleanWritable、ByteWritable、IntWritable、FloatWritable、LongWritable、DoubleWritable、 TextWritable、MapWritable、ArrayWritable
10. What are the components of yarn and what are their functions? What are the three main schedulers, and which one does Hadoop default?
It is composed of components such as ResourceManager, NodeManager, ApplicationMaster and Container;
ResourceManager: Process client requests, monitor NodeManager, start or monitor ApplicationMaster, resource allocation and scheduling
NodeManager: Manage resources on a single node, process commands from ResourceManager, and process commands from ApplicationMaster
ApplicationMaster: Responsible for data segmentation, applying for resources for applications and assigning them to internal tasks, task monitoring and fault tolerance
Container: Container is a resource abstraction in YARN. It encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, and network.
FIFO, Capacity Scheduler (capacity scheduler) and Fair Scheduler (fair scheduler)
The default resource scheduler of Hadoop is Capacity Scheduler (capacity scheduler)
11. If you need to cluster the namenode nodes, how do you need to configure it?
Password-free login between servers must be implemented
Modify the etc/hadoop/slaves file under the hadoop2.7.2 folder
12. Briefly describe the function of the ring buffer in the Shuffle process?
Key and value are output from the map() method, collected by the outputcollector, obtain the partition number through the getpartitioner() method, and then enter the ring buffer. By default, the size of the ring buffer is 100M. When the amount of map input data into the ring buffer reaches 80MB or more, the overflow writing process will start. If other data enters during the overflow writing process, the remaining 20% will be written in reverse. The overflow writing process will first partition according to the key and value, and then sort. Finally, the maptask overflow file is merged and sorted and then falls into the local disk. The reduceTask copies the data of the same partition under multiple mapTasks to different reducetasks for merging and sorting. Read a set of data to the reduce() function at a time
13. During the execution of Mapreduce, what are the main tasks before executing Reduce Task?
Calculate the number of MapTasks, divide the file, write data to the ring area, then partition the file quickly, overflow to the file (partitioned and in order), and then merge and sort
14. What are the three main attributes of hdfs-site.xml?
Specify the number of HDFS replicas: dfs.replication
指定SecondaryNameNode:dfs.namenode.secondary.http-address
指定DataNode:dfs.datanode.address
15. What are the precautions for Hadoop fully distributed mode?
Design the cluster: put namenode, datanode, resourcemanager, nodemanager, historyserver on different servers
Set the server address where the namenode node is located and the storage directory of temporary files (core-site.xml)
The number of copies of the configuration file (hdfs-site.xml)
Placement mapred-site.xml sum yarn-site.xml
The other nodes in the cluster are consistent with the configuration on hadoop01
16. 3 modes in which Hadoop cluster can run?
Local operation mode, pseudo-distributed operation mode and fully distributed operation mode
17. Summarize the role and significance of Combiner in one sentence? And explain the premise of using it?
The meaning of Combiner is to locally summarize the output of each MapTask to reduce the amount of network transmission
18. Use the shell script to complete the following functions, find the extreme value, and output the maximum and minimum according to the input data (to the extent of handwriting)
#!/bin/bash
min = $ 1
max=$1
for i in "$@"
do
if [ $min -gt $i ]
then
min = $ i
be
if [ $max -lt $i ]
then
max=$i
be
done
echo "Maximum value is" $max
echo "The minimum value is" $min
19. Write a mapreduce of the most basic wordcount word statistics (to the degree of handwriting, you can refer to the code sample in class)
public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text keyText = new Text();
private IntWritable one = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String [] fileds = line.split(" ");
for (String filed : fileds) {
keyText.set(filed);
context.write(keyText, one);
}
}
}
public class WcReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable total = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int I = 0;
for (IntWritable value : values) {
sum += value.get();
}
total.set(sum);
context.write(key, total);
}
}
public class WcDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration());
job.setJarByClass(WcDriver.class);
job.setMapperClass(WcMapper.class);
job.setReducerClass(WcReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
20. List commonly used hdfs commands (at least 10, and explain the role of the command)
Local file à HDFS:
-put: upload local data to hdfs
-copyFromLocal: copy local file data to hdfs
-moveFromLocal: move local file data to hdfs, local data will be deleted after success
-appendToFile: Append a file to the end of an existing file
Between HDFS and HDFS:
-ls: View the hdfs file directory
-mkdir: create a directory on HDFS
-rm: delete files or folders
-rmr: delete recursively
-cp: copy files from one directory to another
-mv: Move files in the HDFS directory
-chown: modify the user permissions of the file
-chmod: modify the read and write permissions of the file
-du -h: Temporarily used space of the folder
-df -h: View system partition
-cat: view files
HFDSàLocal:
-get: download files from hdfs to local
-getmerge: merge files in the hdfs directory to the local
-copyToLocal: copy files from hdfs to local
21. List 10 commonly used linux commands and explain their functions
View current network IP: ifconfig
Modify the IP address: vim /etc/ssconfig/network-scripts/ifcfg-eth0
Start the firewall: servicer network start
Turn off the self-start of the iptables service: chkconfig iptables off
Create an empty file: touch file name
Add a new user: useradd username
Display the username of the logged in user: who am i
Change permissions: chmod 421 file or directory
View all processes in the system: ps -aux
Query the installed rpm package: rpm -qa | grep rpm package