Big data-hadoop basic theoretical knowledge

Basic theoretical knowledge of Hadoop

  • HDFS (Hadoop Distributed File System) design and development based on the paper published by Google
  • In addition to the same characteristics of other distributed file systems, there are also unique features:
    high fault tolerance: hardware is always unreliable and
    high throughput: applications that access large amounts of data provide high throughput and support
    large file storage: support storage TB- PB level data
  • What is HDFS suitable for?
    Large file storage and access
    Streaming data access
  • What is HDFS not suitable for?
    Mass storage of small files
    Random write
    Low latency read

Examples of HDFS application scenarios:

  • HDFS is a distributed file system in the Hadoop technical framework, which manages files deployed on independent physical machines.

  • Can be used in a variety of scenarios, such as:

  • Website user behavior data storage

  • Ecosystem data storage

  • Meteorological data storage

Basic system architecture

Insert picture description here

HDFS architecture key design

Insert picture description here

HDFS high reliability (HA)

Insert picture description here

Metadata persistenceInsert picture description here

Configure HDFS data storage strategy

  • By default, HDFS NameNode automatically selects DataNode to save a copy of the data. In actual business, the following scenarios exist:

  • Different storage devices exist on the DataNode, the data needs to select a suitable storage device to store data hierarchically

  • The importance of the data in different directories of the DataNode is different. The data needs to be selected according to the directory label to save a suitable DataNode node

  • The DataNode cluster uses heterogeneous servers, and critical data needs to be stored in a highly reliable node group.

HDFS data integrity guarantee

  • The main purpose of HDFS is to ensure the integrity of stored data, and to deal with the reliability of each component failure.

  • Rebuild the duplicate data of the failed data disk: When the DataNode fails to report to the NameNode periodically, the NameNode initiates a copy reconstruction action to recover the lost copy.

  • Cluster data balancing: data balancing mechanism, this mechanism ensures that the data is evenly distributed on each DataNode

  • Metadata reliability guarantee

  • Use the logging mechanism to operate metadata, and the metadata is stored on the active and standby NameNode

  • The snapshot mechanism implements the common snapshot mechanism of the file system to ensure that it can be restored in time when data is misused.

  • Safe mode: In case of data node failure and hard disk failure, it can prevent the failure from spreading.

Common shell commands

Insert picture description here

Published 26 original articles · praised 5 · visits 777

Guess you like

Origin blog.csdn.net/weixin_44730235/article/details/105051565