Big data platform core technology A knowledge summary

Big data knowledge summary

Chapter 3 Distributed File System HDFS

1. The distributed file system has two core components: master and slave nodes.

The master node (Master Node), also known as the name node (NameNode), is responsible for the creation, deletion and renaming of files and directories; the slave node (Salve Node), also known as the data node (DateNode), is responsible for the storage and reading of data. . The data node also creates, deletes and replicates data blocks according to the commands of the name node.

2. Limitations of HDFS:

  1. . Not suitable for low-latency data access; 2). Unable to efficiently store a large number of small files; 3). Multiple users can write and modify files arbitrarily (can delete and add); 3). Not suitable for real-time trading systems.

3. HDFS also uses the concept of blocks, and the default size of a block is 64MB.

4. The obvious benefits of HDFS adopting the block concept:

1). Support large-scale file storage (files are stored in blocks); 2). Simplify the system design (simplifying the design so that metadata does not need to be stored with file blocks); 3). Suitable for data backup.

5. The name node has two core components: FsImage and EditLog

FsImage is used to maintain the file system tree and the metadata of all files and folders in the file tree;

EditLog (log file) records all file creation, deletion, renaming and other operations.

6. The second name node is an important part of the HDFS architecture. It has two functions:

  1. .Can complete the merge operation of EditLog and FsImage, reduce the file size of EditLog, and shorten the restart time of the name node; 2).Can be used as a "checkpoint" of the name node to save metadata information in the name node.

6. The second name node is an important part of the HDFS architecture. It has two functions:

  1. .Can complete the merge operation of EditLog and FsImage, reduce the file size of EditLog, and shorten the restart time of the name node; 2).Can be used as a "checkpoint" of the name node to save metadata information in the name node.

7. HDFS client: shell, python, web page.

8. HDFS’s communication protocols are all built on TCP/IP.

9. The interaction between the client and the data node is implemented through remote procedure call RPC. The name node will not actively initiate RPC.

10. Limitations of HDFS architecture: p54

11. HDFS only sets up a unique name node, so the name node cannot be used again after it dies.

12. HDFS has three data copies, so it has high fault tolerance.

13. Using multiple copies can speed up data transmission (√).

14. Each data node will regularly send "heartbeat" information to the name node to determine whether the server is available.

Chapter 4 Distributed Database Hbase

1. Hbase started from the Google BigTable paper.

2. Hbase is applied to the PB level.

3. Hbase's storage is still in HDFS, which uses HDFS as a highly reliable underlying storage.

4. Hbase is a columnar database that mainly stores unstructured and semi-structured loose data, but it can also store structured data.

5. The characteristics of Hbase are: high reliability, high performance, column-oriented, and scalable.

6. Hbase uses sparse storage.

7. Comparative analysis between Hbase and traditional databases:

1). Data type: Relational database uses relational data and has rich data types and storage methods. Hbase uses simple data types and stores data as uninterpreted strings;

2). Data operations: Relational databases not only have additions, deletions, modifications, and queries, but also multi-table queries. Hbase does not have complex relationships between tables, only simple additions, deletions, modifications and queries;

3). Storage mode: Relational database is row storage, Hbase is column storage;

4). Data index: Relational databases can build complex multiple indexes for different columns, while Hbase has only one index - row construction;

5). Data maintenance: If the old value in the relational database no longer exists after being overwritten, Hbase will not delete the old version;

6). Scalability: MySQL has a maximum number of fields of 1024, while Hbase has no limitations.

8. Hbase client: shell, java, python, web page, etc.

9. Hbase column families can be added at will, which is convenient and flexible and supports dynamic expansion.

10. Hbase determines a piece of information based on "four-dimensional coordinates": row construction, column family, column qualifier, and timestamp.

11. The master in Hbase is not very important. When Hbase starts and the master hangs up, it can still be read and written, but the directory cannot be created.

12. The implementation of Hbase includes three major components: library functions, a Master server, and many Region servers. (Tested at the end of the semester)

Library functions: connect to each client;

Master: Responsible for managing and maintaining the partition information of the Hbase table;

Region: Responsible for storing and maintaining the Region assigned to itself, and processing read and write requests from clients.

13. The Hbase client does not rely on the Master but relies on ZooKeeper (distributed file management information device) to obtain the location information of the Region.

14. In 2006, the default size of each Region was 100 to 200MB, and it is currently 2GB.

15. The names and functions of each level in the Hbase three-tier structure.

16. In order to speed up access, all Regions of the meta table will be stored in memory.

17. The ZooKeeper server can help elect a Master as the cluster supervisor and ensure that there is always only one Master server running at any time, which can avoid the "single point of failure" problem of the master server.

Chapter 7 MapReduce

1. MapReduce data cannot be separated if there are certain dependencies between them (dependencies in space and time).

2. The core idea of ​​MapReduce is "divide and conquer", and the design concept is "the computer moves closer to the data". (Tested at the end of the semester)

3. MapReduce highly abstracts complex parallel computing processes running on large-scale clusters into two functions: Map (mapping) and Reduce (induction).

4. The number of splits determines the number of Map tasks, and reduce is determined by the number of Solts.

5. Shuffle refers to partitioning, sorting, and merging the Map task output results (merging cannot change the final result, and it has been tested at the end of the semester).

6. The input of Map uses Hadoop’s default <key, value> input method, and its output has not been shuffled.

7. The input and output process of Map and Reduce. (Tested at the end of the semester)

Chapter 9 Data Warehouse Hive

1. The difference between internal tables and external tables in Hive database: (tested at the end of the semester)

The main answer is who manages the data, who creates the table; what is the difference when deleting files?

Answer: The files, metadata and statistical data of the internal table are managed by hive and are stored in the hive, metastore, warehouse, and dir directories; when the table or partition of the internal table is deleted, the corresponding data and metadata will also be deleted;

The external table can specify location and can operate without hive; when the external table deletes the table or partition, the data still exists.

2. The data warehouse is a subject-oriented, integrated, relatively stable data collection that reflects historical changes and is used to support management decisions. (Required exam at the end of semester)

3. HIve has no index.

4. Hive is a data warehouse tool built on Hadoop.

5. Hive defines a simple SQL-like query language ---- HiveQL, which is compatible with 95% of SQL.

6. Hive relies on HDFS to store data and MapReduce to process data.

7. Users can run MapReduce tasks by writing HiveQL.

8. Hive is an analysis tool that can organize and use data effectively, reasonably, and intuitively.

9. Hive’s three core components: user interface, driver module, and metadata storage module.

The driver module includes: compiler (translated into MapReduce tasks instead of code), optimizer, executor, and parser (to check whether the SQL is correct);

Source data storage module: It is an independent relational database.

10. If you use Hive to create an external table, its path must be a folder path and cannot point to a file. Because it thinks you will put multiple files.

Guess you like

Origin blog.csdn.net/yh1009/article/details/131631662