Big Data Practical Linux Interview Questions

1. What is cluster technology? What are the advantages of cluster technology?

 (1) Cluster cluster technology can be defined as follows: a group of mutually independent servers are represented as a single system in the network and managed in a single system mode. This single system provides highly feasible services for client workstations. In most modes, all computers in the cluster have a common name, and the services running on any system in the cluster can be used by all network clients. Cluster must be able to coordinate and manage the errors and failures of separate components, and transparently add components to the Cluster.

A Cluster contains multiple (at least two) servers with shared data storage space. When any server runs an application, the application data is stored in the shared data space. The operating system and application files of each server are stored in its own local storage space. Each node server in the Cluster communicates with each other through the internal LAN.

When a node server fails, the applications running on this server will be automatically taken over on another node server. When an application service fails, the application service will be restarted or taken over by another server. When any of the above failures occurs, customers will be able to quickly connect to the new application service.

(2) Advantages: high scalability; high availability HA. If a node in the cluster fails, its tasks can be passed on to other nodes. Can effectively prevent single point of failure;

high performance. Load balancing cluster allows the system to access more users at the same time;

High cost performance. A high-performance system can be constructed using inexpensive hardware that meets industry standards.

 

 

2. What is NTP? What is the use?

(1) The full name of NTP is "Network TimeProtocol", that is, Network Time Protocol. It is a time synchronization protocol defined by RFC 1305, which is used to synchronize time between distributed time servers and clients.

NTP is based on UDP packets for transmission, and the UDP port number used is 123.

The purpose of using NTP is to synchronize the clocks of all devices with clocks in the network, so that the clocks of all devices in the network are consistent, so that the devices can provide multiple applications based on uniform time. For a local system running NTP, it can not only receive synchronization from other clock sources, but also can be used as a clock source to synchronize other clocks, and it can synchronize with other devices.

(2) The NTP server is mainly used to synchronize the time of each computer in the network. Its purpose is to synchronize the computer's clock to UTC, with an accuracy of 0.1ms in a local area network, and an accuracy of 1-50ms in most places on the Internet. It can synchronize the time of the computer to its server or clock source (such as quartz clock, GPS, etc.), it can provide high-precision time correction, and can use encrypted confirmation to prevent virus protocol attacks.

3. What is SSH? What is the use?

(1) SSH is the abbreviation of Secure Shell, formulated by the IETF Network Working Group; SSH is a security protocol based on the application layer. SSH is a more reliable protocol designed to provide security for remote login sessions and other network services. Using the SSH protocol can effectively prevent information leakage in the remote management process. SSH was originally a program on UNIX systems, and then quickly expanded to other operating platforms. SSH can make up for loopholes in the network when used correctly.

(2) Transport layer protocol [SSH-TRANS]

Provides server authentication, confidentiality and du integrity. In addition, it sometimes provides compression. SSH-TRANS usually runs on TCP/IP connections, and may also be used for other reliable data streams. SSH-TRANS provides strong encryption technology, password host authentication and integrity protection. The authentication in this protocol is based on the host, and the protocol does not perform user authentication. Higher-level user authentication protocols can be designed to be on top of this protocol.

 

User authentication protocol [SSH-USERAUTH]

It is used to provide the client user authentication function to the server. It runs on top of the transport layer protocol SSH-TRANS. When SSH-USERAUTH starts, it receives the session identifier from the lower-level protocol (from the exchange hash H in the first key exchange). The session identifier uniquely identifies this session and is suitable for marking to prove the ownership of the private key. SSH-USERAUTH also needs to know whether the low-level protocol provides confidentiality protection.

 

Connection protocol [SSH-CONNECT]

Divide multiple encrypted tunnels into logical channels. It runs on the user authentication protocol. It provides interactive login session, remote command execution, forwarding TCP/IP connection and forwarding X11 connection.

 

4. Briefly describe the certificate login process.

First, the client generates the private key and public key of the certificate. The private key is placed on the client, and the public key is uploaded to the server (remote login). Generally, for security, to access the private key of a hacker copying the client, the client will set a password when generating the private key, and each time you log in to the ssh server, the client must enter the password to unlock the private key (if you are working, you A private key without a password was used, and the server was hacked one day, and you couldn’t clean it even if you jumped to the Yellow River).

Then, the server adds the credit public key. Upload the public key generated by the client to the ssh server and add it to the specified file. In this way, the configuration of ssh certificate login is completed.

5. What is the design philosophy of HDFS?

Stores oversized files; the "oversized files" here refer to files of hundreds of MB, GB, or even terabytes; the most efficient access mode is one write and multiple reads (streaming data access). The data set stored in HDFS is used as the analysis object of hadoop. After the data set is generated, various analyses are performed on this data set for a long time. Each analysis will design most or even all of the data in the data set, so the time delay for reading the entire data set is more important than the time delay for reading the first record; it runs on an ordinary cheap server. One of the design concepts of HDFS is to allow it to run on ordinary hardware. Even if the hardware fails, fault-tolerant strategies can be used to ensure high data availability.

6. Briefly describe the architecture of YARN.

Yarn still belongs to the master/slave model as a whole, and mainly relies on three components to achieve functions.

The first is ResourceManager, which is the arbiter of cluster resources. It consists of two parts: one is a pluggable scheduling Scheduler, and the other is ApplicationManager, which is used to manage user jobs in the cluster.

The second is the NodeManager on each node, which manages user jobs and workflows on that node, and also continuously sends its own Container usage to the ResourceManager.

The third component is the ApplicationMaster. The main function of the user's job life cycle manager is to apply for computing resources (Containers) from the ResourceManager (global) and interact with the NodeManager to execute and monitor specific tasks.

 

7. What are the characteristics of HBase?

First, strong consistency reads and writes. HBase is not an eventually consistent data store, which makes it suitable for high-speed counting and aggregation tasks.

Second, automatic sharding. HBase tables are distributed in the cluster through regions. When the data grows, the region will be automatically divided and redistributed.

Third, the RegionServer automatically fails over.

Fourth, Hadoop/HDFS integration. HBase supports external HDFS as its distributed file system.

Fifth, MapReduce integration. HBase supports large concurrent processing through MapReduce, and HBase can act as a source and sink at the same time.

Sixth, the Java client API. HBase supports easy-to-use Java API for programmatic access.

Seventh, Thrift/REST API. Support Thrift and REST to access HBase.

Eighth, Block Cache and Bloom Filter. HBase supports Block Cache and Bloom filters for query optimization to improve query performance.

Ninth, operation and maintenance management. HBase provides built-in web pages and JMX indicators for operation and maintenance.

Tenth, the big table (BigTable). A table can have hundreds of millions of rows and millions of columns.

Eleventh, column (family) storage, retrieval and permission control.

Twelfth, sparseness. Null columns in the table do not occupy storage space.

 

8. How does HBase store data?

(1) HRegion is the smallest unit for HBASE to store data. A Table can have one or more Regions, and they can be on the same HRegionServer or distributed on different HRegionServers. An HRegionServer can have multiple HRegions, which belong to different Tables. HRegion is composed of multiple stores, and each store corresponds to a Column Family of a Table in this HRegion, that is, each Column Family is a centralized storage unit.

(2) Store is the core of storage in HBase. It implements the functions of reading and writing HDFS. A Store is composed of one MemStore and zero or more StoreFiles.

(3) MemStore is a write buffer (In Memory Sorted Buffer). All data will be written into the WAL log first, and then written into the MemStore. The MemStore flushes the data into the stratum HDFS file (HFile) according to a certain algorithm, usually every Each Column Family in an HRegion has its own MemStore.

(4) HFile (StoreFile is a simple package of HFile, that is, the bottom layer of StoreFile is HFile) for storing HBase data (Cell/KeyValue). The data in HFile is sorted by RowKey, Column Family, and Column. For the same Cell (that is, the three values ​​are the same), it is sorted in reverse order by timestamp.

(5) WAL is Write Ahead Log, which is called HLog in the early version. It is a file on HDFS. As its name implies, all write operations will first ensure that data is written to this Log file before it is actually updated. MemStore, finally written into HFile. WAL files are stored in the directory /hbase/WALs/${HRegionServer_Name}.

There is also a BlockCache in the above figure: read cache, each new query data will be cached in BlockCache.

HBase's LSM storage idea

LSM tree (Log-Structured Merge Tree), its core idea is to assume that the memory is large enough, so there is no need to write the data to the disk every time there is data update, and batch these modification operations after reaching the specified size limit Write to disk.

LSM simple model

 

9. What are the main functions of HRegionServer?

RegionServer is a container for storing Regions, intuitively speaking, it is a service on the server. The RegionServer is the node that actually stores data, and it is finally stored in the distributed file system HDFS. After the client obtains the address of the RegionServer from ZooKeeper, it will directly obtain the data from the RegionServer. For the HBase cluster, its importance is greater than the Master service.

10. What is the role of Zookeeper?

RegionServer relies heavily on ZooKeeper services, and ZooKeeper plays a role in HBase similar to a housekeeper. ZooKeeper manages the information of all RegionServers in HBase, including which RegionServer the specific data segment is stored on. Each time the client connects to HBase, it actually communicates with ZooKeeper first, queries which RegionServer needs to be connected, and then connects to the RegionServer.

A brief summary of the role of Zookeeper in the HBase cluster is as follows: For the server, it is an important dependency for cluster coordination and control . For the client, it is an indispensable part of querying and operating data .

It should be noted that when the Master service is down, read and write operations can still be performed; but once ZooKeeper is down, the data cannot be read, because the location of the metadata table hbase:meata required to read the data Stored on ZooKeeper. It can be seen that zookeeper is essential to HBase.

11. What are the characteristics of Hive?

First, analysis and summary of massive structured data.

Second, simplify the complicated MapReduce writing tasks into SQL statements.

Third, flexible data storage formats, supporting JSON, CSV, TEXTFILE, RCFILE, SEQUENCEFILE, ORC (Optimized Row Columnar) these storage formats.

 

12. What is Metastore? What does it do? What does it contain?

Hive Metastore (HMS) is a separate service, not part of Hive, and does not even have to be on the same cluster. HMS stores metadata on the back end of Hive, Impala, Spark, and other components.

.Metastore is the data of the data, mainly the information describing the attributes of the data . It is used to support functions such as storage location, historical data, resource search, and file recording. Metadata can be regarded as an electronic catalog. In order to achieve the purpose of compiling the catalog, the content or characteristics of the data must be described and collected, so as to achieve the purpose of assisting data retrieval.

Metastore is a metadata service, and its role is: the client connects to the metastore service, and the metastore connects to the MySQL database to access metadata. With the metastore service, multiple clients can connect at the same time, and these clients do not need to know the username and password of the MySQL database, only need to connect to the metastore service.

Contains DBS table, TBLS table, PARTITIONS table and SDS table.

13. What is the difference between a table and an external table?

(1) When the table is built, it is an external table with the external keyword, otherwise it is an internal table

(2) Both internal tables and external tables can be specified by their own location when creating tables

(3) When deleting a table, the external table will not delete the corresponding data, only the metadata information will be deleted, and the internal table will be deleted

(4) Other usages are the same

14. What is the difference between HiveServer and HiveServer2?

HiveServer2 is an optional Hive built-in service that allows remote clients to use different programming languages ​​to submit requests to Hive and return results. HiveServer2 is an improved version of HiveServer1, which mainly solves the problem of being unable to handle concurrent requests from multiple clients and authentication.

Guess you like

Origin blog.csdn.net/qq_45059457/article/details/108873133