Hadoop collation

Hadoop
Q1: Now there are 1T files, which are stored as numbers and lines. Two lines of text are the same. Please find these two lines;

  1. Divide the file: Take the modulus of the hashcode according to the memory, (roughly divide n files according to the memory limit, hashcode%n), the remainder obtained is the name of the file, and the content of the file is the content of the line;
  2. To traverse the contents of the file, the same two rows of data are in the same file;

Q2: Do a full sorting in positive order for the entire numerical file;

Adopt a two-part sorting method:
idea one

  1. Every time a part of small data is taken out to sort, these sorted small files are characterized by internal order, but the interval is disordered.
  2. Finally, the merge algorithm is used to merge the data;

Idea two

  1. Set a storage range for each file in advance, traverse the numerical files, and store each number in these files separately. After storage, these files are characterized by orderly intervals, but disordered internally;
  2. Sort these files separately and merge them;

Q3: What is the role of clusters?

Cut files and perform parallel calculations on each file.

Q4: What is the role of data migration?

The purpose of migration is to put two identical rows on the same server, and use mobile programs to perform operations on data on different servers.

Q5: Under what circumstances will memory and disk swap?

The sql statement adds the table data in the disk to the memory, and after the modification in the memory, writes it back to the table. This is an interaction. The necessity of the interaction between memory and disk is embodied in the assumption that the data in the memory is loaded too much, is already full, and overflows to the disk, freeing up data to store new data in the memory. If the overwritten data is written to the disk, the client needs to query the data, and the memory and the disk need to be continuously interacted.

Q6: Why does the NameNode do metadata persistence?

The NameNode is based on memory storage, and the master node only maintains metadata information and does not exchange with the disk. For fast processing, all processing is done in memory. But because the memory is easily lost due to power failure, a persistence operation is needed. (The metadata information in the memory is permanently stored on the disk in the form of a file)

Q7: Explanation of serialization and deserialization;

Serialization is to convert file objects into binary bytecode files. The advantage of conversion is compatibility, which is used for cross-platform, cross-node and cross-file use. Deserialization is to turn the binary of fsimage into a file.

Q8: The startup process of the Hadoop cluster.

When the Hadoop cluster is started, the fsimage file will be read first, and the edits log file will be generated (all empty). edits and fsimage will do a merge work, merge into a new fsimage, this fsimage will not disappear no matter how many times it is started, and the editslog will continue to grow, when the editslog is too large, the file recovery using fsimage will be fast Performance, based on it as a recovery, will produce a re-merging job;

Q9: When Hadoop is restored, fsimage is used as the basis, and fsimage and edits are merged. How to merge?

Introduce SecondaryNameNode (SNN) to be responsible for merging; (Hadoop1.0)

Q10: What is the role of NameNode?
The role of the NameNode is to maintain the metadata information of the block files stored in the cluster. The slave nodes are used to maintain and manage their own block information. Each slave node will have the role of datanode, and each datanode is only responsible for maintaining and managing the block files placed on its own node.

Q11: Master-slave architecture?
Master-slave architecture: The master node manages file metadata information MetaData, and the slave node is responsible for executing and processing specific file data, that is, block files. Each role of the master node is used to maintain the integrity of metadata information, and the role of the slave node is used to maintain the resource information of the node, and specifically execute the operation mode of the data block placed on the node.

Q11: HDFS read and write process?

  1. The first solution to the writing process is: how to complete the process of cutting data into blocks and uploading them to different nodes. The reading process solves how to complete the process of reading different blocks of information from different servers in the node.
  2. Writing process:
    2-1 client creates an output stream (write to disk file) through the distributed file system object, creates a path when writing, and interacts with the namenode according to the location pointed to by the path;
    2-2 client according to 128M Split the file. After the file is split, the client requests the upload location of the first block of namenode. Each block has three copies, and the three copies are distributed to different nodes;
    2-3 client only contacts the first node , First upload the first block to the first dn1, after dn1 has received it, transmit it to dn2, and perform pipeline streaming. When all the blocks are uploaded, the client responds to the namenode that the transmission is complete;
  3. Reading process:
    3-1 client requests NameNode to obtain a list of block replica locations;
    3-2 Linear and DataNode obtain blocks, and finally merge them into one file;
    3-3 Select the nearest distance in the block replica list;
    3-4 Use md5 Verify data integrity;

Q12: What happens in the format stage?
The format stage produces: fsiimage, image file, current cluster id.

Q13: In which file is the configuration information of the master and slave nodes reflected?

  1. core-site.xml specifies the configuration information of the main role process: NameNode;
  2. hdfs-site.xml: configure the configuration information of the slave node, the number of copies;
  3. Slaves specifies the configuration information of slave nodes;

Q14: Is it necessary for distributed architecture computing to move to data?
Although the data is huge, it is not cost-effective to enter the data into the program, then the opposite is the way, the program is distributed to the place where the data is located for calculation, that is, the so-called mobile computing is more cost-effective than mobile data.

Q15: NameNode persistence

1. Hodoop file writing process?
2. How much do you know about hdfs? Reading and writing process, structure?
3. In the backup three of the writing process, how can one of them fail to write?
4. hdfs HA (process, startup process)?
5. Hadoop optimization?
6. The difference between hadoop1 and hadoop2?
7. What are the components of Hadoop?
8. Hadoop data skew problem?
9. What type of HDFS storage?
10. The difference between
Hadoop 1.x and 2.x? 11. Tell me what is configured in each configuration file in Hadoop ? 12. Hadoop
cluster Optimization?
13. HDFS realization?
14. HDFS file creation-workflow?
15.hdfs asynchronous reading?-------------Reference blog: http://blog.csdn.net/androidlushangderen/article/details/52452215
16.hdfs api source code answer: file creation workflow ?
17.hdfs API's new? --------- do not know
after 18.hadoop in a job is submitted to resourcemanager, resourcemanager generates a kind of container to put this job?
19. A block in the hadoop cluster cannot copy data to other nodes, what should I do? If
there is a large amount of concurrency and multiple blocks cannot copy data, what should I do? -------do not know
20. How does Zookeeper achieve high availability of Hadoop? -------- Not familiar with
21. Hadoop system?
22. MR process?

Guess you like

Origin blog.csdn.net/qq_29027865/article/details/111546904