South China Agricultural University 2021 spring "Hadoop big data processing technology" final review paper

South China Agricultural University 2021 spring "Hadoop big data processing technology" final review paper

foreword

I sorted out some test papers on the Internet and previous years' test papers and test sites, and synthesized a set of papers, hoping to help you review better. When I first published the papers for my classmates, some reported that my papers were too detailed. In fact, he did not review well enough, because dry memorization is really easy to get lost in the review materials, and you can’t tell which is more important. Just force my brains into a mess. What we have to do is to learn from the test papers and help the brain build a network of knowledge connections.
In the final exam, I found that the exam papers were more detailed than mine. It didn't mean that it was difficult, but that many of the exam syllabus were only a few words and some of the results that I understood were turned into big questions.

1. Multiple choice questions

1. Which of the following programs is responsible for HDFS data storage.
A. NameNode B. Jobtracker
C. Datanode D. secondaryNameNode
2. The blocks in HDFS save several backups by default.
A. 3 copies B. 2 copies
C. 1 copy D. Not sure
3. Which of the following processes is responsible for MapReduce task scheduling.
A. NameNode B. Jobtracker
C. TaskTracker D. secondaryNameNode
4. Which of the following is not a Hadoop (YARN) scheduler strategy.
A. FIFO scheduler B. Capacity scheduler
C. Fair scheduler D. Priority scheduler
5. Which of the following is correct when the client uploads files?
A. The data is passed to the DataNode through the NameNode
B. The client divides the file into blocks and uploads them sequentially
C. The client only uploads the data to one DataNode, and then the NameNode is responsible for copying the blocks
D. None of the above is correct
6. In the experimental cluster When using the jps command to check the progress of the master node, which of the following items appears on the terminal indicates that the Hadoop master node starts successfully?
A. Namenode, Datanode, NodeManager
B. Namenode, Datanode, secondaryNameNode
C. Namenode, Datanode, HMaster
D. Namenode, ResourceManager, secondaryNameNode
7. If there is no special setting for the key and value in the MapReduce programming model, which of the following is an inappropriate operation for MapReduce.
A. Max B. Min
C. Count D. Average
8. In the MapReduce programming model, which interface must the key of the key-value pair <key, value> implement?
A. WritableComparable B. Comparable
C. Writable D. LongWritable
9. HBase is a distributed column storage system, and records are stored centrally.
A. Column family B. Column
C. Row D. Not sure
10. Which of the following must be included in the Region composition of HBase.
A. StoreFile B. MemStore
C. HFile D. MetaStore
11. When designing the data table of the distributed data warehouse hive, in order to make sampling more efficient, what operations can generally be performed on the continuous fields in the table.
A. Bucket B. Partition
C. Index D. Table
12. When the client queries the HBase database for the first time, which table needs to be searched first.
A. .META. B. –ROOT-
C. User table D. Info table
13. HDFS has a gzip file with a size of 75MB and an LZO (with index) file with a size of 75MB, and the client sets the block size to 64MB. When running the mapreduce task to read the file, the input split sizes are:
A. One map reads 64MB, and the other map reads 11MB.
B. Both read 75MB
. C. When reading gizp, inputSplit is 75MB, and reading LZO One map reads 64MB, and the other reads 11MB.
D. When reading LZO, the inputSplit is 75MB. When reading gzip, one map reads 64MB, and the other reads 11MB.
14. Which item is correct about SecondaryNameNode?
A . Its purpose is to help NameNode merge edit logs, reduce the burden on NameNode and load time during cold start B.
It has no requirements for memory
C. It is the hot standby of NameNode
D. Secondary NameNode should be deployed to a node with NameNode
15. Local Which of the following hadoop shell commands can be used to put files into the cluster?
A, hadoop fs -put
B, hadoop fs –push /
C, hadoop fs –put /
D, hadoop -push /
16. If you want to modify the backup of the cluster Quantity, which of the following configuration files can be modified?
A. mapred-site.xml
B. core-site.xml
C. hdfs-site.xml
D. hadoop-env.sh
17. Which of the following is not the daemon process of HDFS
A, SecondaryNameNode
B, NameNode
C, MrappMaster/YarnChild
D, DataNode
18. Which of the following storage levels is the least for big data?
A, EB B, PB C , TB D, ZB
19. Which of the descriptions about DataNodes in the HDFS cluster is incorrect?
A. All data blocks stored on a DataNode can have the same data.
B. Data blocks that store data uploaded by clients
C. Between DataNodes Can communicate with each other
D. Respond to all read and write data requests from the client, and provide support for the client to store and read data.
20. Which of the following operations in the Shuffle process of MapReduce is the last to be done?
A. Sorting B. Merging C. Partitioning D. Overwriting
21. Which of the following descriptions about HDFS is correct?
A. NameNode disk metadata does not save block location information
B. DataNode maintains communication with NameNode through long connections
C. HDFS cluster Support random reading and writing of data
D. If the NameNode goes down, the SecondaryNameNode will take over to make the cluster continue to work.
22. The size of a gzip file is 300MB, and the block size set by the client is 128MB. How many blocks does it occupy?
A, 1 B, 2 C, 3 D, 4
23. Which of the following is not an advantage of Hadoop clusters?
A. High fault tolerance
B. High cost
C. High reliability
D. High scalability
24. Which of the following is not included in the three installation modes of Hadoop?
A. Two distributed mode
B. Fully distributed mode
C. Pseudo-distributed Mode
D, stand-alone mode
25, the storage cores of Hbase are:
A, HRegion B, HStore C, StoreFile D, MemStore
[The choices made during the exam are quite satisfactory]

2. Judgment questions

  1. Both Hadoop and HBase support random reading and writing of data.
  2. The NameNode is responsible for managing the metadata information. For each read and write request from the client, it will read or write metadata information from the disk and feed it back to the client.
  3. The input split of MapReduce must be a block.
  4. MapReduce is suitable for online processing of massive data above the PB level.
  5. MapRecue's sorting-based method in the shuffle stage will gather data with the same key together.
  6. During MapReduce calculation, the same key will be sent to the same reduce task by default for processing.
  7. HBase does not need to occupy storage space for empty (NULL) columns.
  8. HBase can have columns or not column families.
  9. HDFS is not only suitable for the storage of very large data sets, but also suitable for the storage of small data sets.
  10. HDFS accesses data in the file system in the form of streams.
  11. HDFS is highly fault-tolerant and designed to be deployed on low-cost hardware.
  12. Each block on the DataNode will only generate one file in the local file system, that is, the actual data file
  13. Pipeline writing of HDFS, the writing process does not require the participation of NameNode
  14. Multiple NameNodes can exist simultaneously in a Zookeeper cluster
  15. When an external table in Hive is deleted, not only its metadata but also the data in the table will be deleted.

3. Short answer questions

  1. Briefly describe the characteristics of big data technology. (4V+)
  2. Start the Hadoop system. When using the bin/start-all.sh command to start, please give the startup sequence of each process in the cluster (the sequence uses -> connection).
  3. Briefly describe the main technical features of HBase. (This question is relatively open, and you can answer it from the aspects of Hbase concept characteristics, Hbase table characteristics, etc.)
  4. In the Hive data warehouse, the following external tables are created. Please give the corresponding HQL query statement
    CREATE EXTERNAL TABLE sogou_ext (
    ts STRING, uid STRING, keyword STRING,
    rank INT, order INT, url STRING,
    year INT, month INT, day INT , hour INT
    ) COMMENT 'This is the sogou search data of extend data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED ASTEXTFILE
    LOCATION'/sogou_ext/20160508';
    (1) HQL statement giving the total number of independent uids
    (2) For keyword, give the HQL statement of the 20 words with the highest frequency (output keyword and its frequency in descending order)
  5. Briefly describe the hadoop installation process, just describe it, no need to list specific steps
  6. Briefly describe the copy placement strategy of Hadoop
    [the actual exam short answer question actually came out to write and list the big data computing model and its products, and each component of the Hadoop ecosystem and architecture represents the product name. This is clearly not in the syllabus, and the teacher put it on another one, and wrote some out from vague memory]

4. Programming Application Questions

  1. 1 million strings (multiple files, each line separated by \t), some of which are the same (repeated), need to remove all the duplicates, and keep the strings without duplicates. Please give design ideas or core codes in combination with the MapReduce programming model.
  2. Multi-file sorting, there are multiple files, each file is only composed of numbers not greater than 65535 and each line has only one number, please sort these numbers in descending order, and divide them into three files according to the size of the number, that is, the number falls in 0 The ~65535 range is divided into three equal intervals, and the final result files are three.
    (The questions here are relatively basic, and the exam may also directly test wordcount, inverted index, TopN, please review by yourself, WritableCompareable, custom sorting code is relatively lengthy, and the possibility of the test is unlikely)
  3. HBase Shell command:
    1) Create a student table stud, including the column family: info, course, where the version of the specified column family course is 2 2
    ) Insert the student whose row key is 101, whose name is zhangsan and age is 19 in info , insert hadoop course grade 91, java grade 80 into course (a total of four lines put)
    3) Change the course grade version to 3
    4) Delete the entire row of stud 101 data
    [HBase instructions must be tested, the result is exactly the same as the experiment]
  4. [Real test questions]
    There are multiple files in a folder, such as a.txt, b.txt, c.txt... Each line in each file has only one positive integer number, and there are several lines. The results are output in a file. Each line in this file contains the average of all numbers in each file and the corresponding file name, separated by spaces. Example:
    a.txt 129.5
    b.txt 100.0
    c.txt 899.8
    To read the file name use:
    FileSplit inputSplit = (FileSplit)context.getInputSplit();
    String fileName = inputSplit.getPath().getName();

reference answer

Multiple Choice Questions:
1-5 CABDB
6-10 DDAAB
11-15 ABCAC
16-20 CCBAB
21-25 ACBAB
True/False Questions:
1-5 FFFFT 6-10 TTFFT 11-15 TFTTF
Short Answer Questions:
1 Large amount of data (Volume)
and various types (Variety)
low value density (Value)
fast speed and high timeliness (Velocity)
variability (Variability), authenticity (Veracity)
2 startup sequence: namenode –> datanode -> secondarynamenode -> resourcemanager -> nodemanager
3 columnar storage, read Strict consistency of writing, providing massive data, data will be automatically fragmented, hbase has automatic failure detection and recovery capabilities for data failures, and provides convenient integration capabilities with HDFS and MAPREDUCE.

High reliability, high performance, scalability, real-time read and write

Hbase features:
data type, only string;
data operation, simple operation, no complex association, connection operation;
storage mode, column storage;
data maintenance, insert instead of modify delete operation;
scalability, can easily increase or decrease the number of hardware .

Features of tables in HBase:
Ø Large: a table can have hundreds of millions of rows and millions of columns;
Ø Column-oriented: column (family)-oriented storage and access control, column (family) independent retrieval;
Ø Sparse: for Empty (null) columns do not occupy storage space, so the table can be designed very sparsely.

4(1)select count(distinct(uid)) from sogou_ext;
(2) select keyword, count(*) as cnt from sogou_ext group by keyword order by cnt desc limit 20;

5 1) Configure the host name, network, edit the host file, restart
2) Configure password-free login, close the firewall, and connect to other machines
3) Unzip and install JDK and Hadoop, configure hadoop core files
4) Copy the hadoop folder to other nodes, configure Environment variable
5) Format HDFS: hadoop namenode -format
6) Start hadoop: start-all.sh
(the process is roughly similar)
6. The first copy: placed on the DataNode where the file is uploaded; if it is submitted outside the cluster, then Randomly pick a node whose disk is not too full and CPU is not too busy
Second copy: placed on a node in a different rack from the first copy
Third copy: other nodes in the same rack as the first copy
More replicas: random nodes

Programming application questions:

  1. Problem-solving ideas: In the map stage, the data in the file is taken out as the key, and the value is irrelevant. It can be set to NullWritable. In the reduce stage, no matter how many values ​​the key has, the key is not stored, and the key is directly output, that is, MapReduce has the function of deduplication.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;



import java.io.IOException;

public class WordDeduplicated {
    
    
    public static class DeMapper extends Mapper<LongWritable, Text,Text, NullWritable>{
    
    
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    
            String[] words = value.toString().split("\t");
            for (String word:words) {
    
    
                context.write(new Text(word),NullWritable.get());
            }
        }
    }
    public static class DeReducer extends Reducer<Text,NullWritable,Text,NullWritable>{
    
    
        @Override
        protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
    
    
            context.write(key,NullWritable.get());
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    
        Job job =  Job.getInstance(new Configuration());
        job.setJarByClass(WordDeduplicated.class);
        job.setMapperClass(DeMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setReducerClass(DeReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job,new Path("file:///D:\\mapreduce\\wordcount_input"));
        FileOutputFormat.setOutputPath(job,new Path("file:///D:\\mapreduce\\deduplicated_output"));

        boolean result = job.waitForCompletion(true);
        System.exit(result?0:1);
    }
}

  1. The key to multi-file sorting is sorting. The default key of MapReduce is ascending order. Therefore, in the map stage, the numbers are set to the opposite number, and then reversed in the reduce stage to achieve descending order.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;


public class MultiFileNumberSort {
    
    
    public static class SortMapper extends Mapper<LongWritable, Text, IntWritable, NullWritable>{
    
    
        @Override
        protected void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException {
    
    
            int number = Integer.parseInt(value.toString().trim());
            context.write(new IntWritable(-number),NullWritable.get());//默认升序,故将number置为相反数
        }
    }
    public static class SortReducer extends Reducer<IntWritable,NullWritable,IntWritable,IntWritable>{
    
    
        int lineNum = 1;
        @Override
        protected void reduce(IntWritable key, Iterable<NullWritable> values,Context context) throws IOException, InterruptedException {
    
    
            context.write(new IntWritable(lineNum),new IntWritable(-key.get()));
            lineNum+=1;
        }
    }
    public static class SortPartitioner extends Partitioner<IntWritable,NullWritable>{
    
    
        @Override
        public int getPartition(IntWritable key, NullWritable value, int numPartitions) {
    
    
            int maxNumber = 65535;
            int bound = maxNumber/numPartitions+1;
            int keyNumber = -key.get();//将负号反转回来
            for(int i=0;i<numPartitions;i++){
    
    
                if(keyNumber<bound*(i+1)&&keyNumber>=i*bound)return i;
            }
            return 0;
        }
    }
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    
        Job job =  Job.getInstance(new Configuration());
        job.setJarByClass(MultiFileNumberSort.class);
        job.setMapperClass(SortMapper.class);
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setPartitionerClass(SortPartitioner.class);
        job.setNumReduceTasks(3);
        job.setReducerClass(SortReducer.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job,new Path("file:///D:\\mapreduce\\input"));
        FileOutputFormat.setOutputPath(job,new Path("file:///D:\\mapreduce\\MultiFileNumberSortDesc_output"));

        boolean result = job.waitForCompletion(true);
        System.exit(result?0:1);
    }
}

  1. 1) create 'stud','info',{NAME=>'course',VERSIONS=>2}
    2) put 'stud','101','info:name','zhangsan' //string
    put ' stud','101','info:age', '19' //integer type
    put 'stud','101','course:hadoop','91' //numeric string
    put 'stud',' 101','course:java','80' //numeric string
  1. alter ‘stud’,{NAME=>’course’,VERSIONS=>3}
  2. deleteall ‘stud’,‘101’
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class NumberAverage {
    
    
    public static class MyMapper extends Mapper<LongWritable, Text,Text, IntWritable>{
    
    
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    
            FileSplit inputSplit = (FileSplit)context.getInputSplit();
            String fileName = inputSplit.getPath().getName();
            String[] words = value.toString().trim().split(" ");
            for (String word : words) {
    
    
                context.write(new Text(fileName),new IntWritable(Integer.parseInt(word)));
            }
        }
    }
    public static class MyReducer extends Reducer<Text,IntWritable,Text, DoubleWritable>{
    
    
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    
    
            int sum=0,count=0;
            for (IntWritable value : values) {
    
    
                sum+=value.get();
                count+=1;
            }
            double average = (double) sum/count;
            context.write(key,new DoubleWritable(average));
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    
        Job job =  Job.getInstance(new Configuration());
        job.setJarByClass(NumberAverage.class);
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DoubleWritable.class);
		
		//本地模式下的调试操作,读者自行更改,Maven配置可以看我Hadoop相关文章
        FileInputFormat.setInputPaths(job,new Path("file:///D:\\mapreduce\\input"));
        FileOutputFormat.setOutputPath(job,new Path("file:///D:\\mapreduce\\NumberAverage_output"));

        boolean result = job.waitForCompletion(true);
        System.exit(result?0:1);
    }
}

Summarize

If there are mistakes, please correct them in the comment area.
Congratulations on your satisfactory test results!

reference site

【1】 https://blog.csdn.net/weixin_44033192/article/details/106757969
【2】 https://blog.csdn.net/weifenglin1997/article/details/79078056
【3】 https://www.doc88.com/p-1098428825218.html
【4】 https://blog.csdn.net/weixin_45739322/article/details/111468519

Guess you like

Origin blog.csdn.net/weixin_43594279/article/details/118073203