First introduction to Hadoop and information security (distributed storage and computing platform for big data)

Table of contents

What is Hadoop

Features of Hadoop

Hadoop advantages

Disadvantages of Hadoop

Important components of Hadoop

information security


What is Hadoop

Hadoop is a distributed storage and computing platform suitable for big data .

The distinction between the broad and narrow sense of Hadoop:
Hadoop in the narrow sense: refers to a framework. Hadoop is composed of three parts : HDFS : distributed file system--"storage; MapReduce : distributed offline computing framework--"computation; Yarn : resources Scheduling framework.

Generalized Hadoop: Generalized Hadoop not only includes the Hadoop framework, but also has some auxiliary frameworks in addition to the Hadoop framework. Flume: log data collection, Sqoop: relational database data collection;

Hive: Deeply relies on the Hadoop framework to complete calculations (sql), Hbase: Database in the field of big data (mysql) Sqoop: Export of data In a broad sense, Hadoop refers to an ecosystem.

Features of Hadoop

Hadoop advantages

1.Hadoop has high reliability in its ability to store and process data.

2. Hadoop distributes data and completes storage and computing tasks through available computer clusters. These clusters can be easily expanded to thousands of nodes and have high scalability.

3. Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node. The processing speed is very fast and efficient.

4. Hadoop can automatically save multiple copies of data, and can automatically redistribute failed tasks, making it highly fault-tolerant.

Disadvantages of Hadoop

1.Hadoop is not suitable for low-latency data access.

2.Hadoop cannot efficiently store a large number of small files.

3.Hadoop does not support multiple users writing and modifying files arbitrarily.

Important components of Hadoop

1. Hadoop=HDFS (distributed file system)+MapReduce (distributed computing framework)+Yarn (resource coordination framework)+Common module.

Hadoop HDFS : (Hadoop Distribute File System) A highly reliable, high-throughput distributed file system - the core idea is to divide and conquer, data cutting, copy making, and distributed storage .

There are several roles involved in the picture:

      NameNode (nn) : stores the metadata of the file, such as file name, file directory structure, file attributes (generation time, number of copies, file permissions), as well as the block list of each file and the DataNode where the block is located, etc.

      SecondaryNameNode (2nn) : Assists NameNode to work better. It is an auxiliary background program used to monitor the status of HDFS and obtain HDFS metadata snapshots at regular intervals.

      DataNode ( dn ) : Stores file block data in the local file system, and verifies block data.

      Note: NN, 2NN, and DN are both role names and process names, and they refer to computer node names! !

2. Hadoop MapReduce: a distributed offline parallel computing framework

     Decompose tasks, decentralize processing, and consolidate results

     MapReduce calculation = Map phase + Reduce phase

     The Map stage is the "dividing" stage, which processes input data in parallel;

     The Reduce stage is the "combination" stage, which summarizes the results of the Map stage;

3. Hadoop YARN: a framework for job scheduling and cluster resource management

Computing resource coordination

There are the following main roles in Yarn. Similarly, they are both role names and process names, and they also refer to the name of the computer node where they are located.

  • ResourceManager(rm): handle client requests, start/monitor ApplicationMaster, monitor NodeManager, resource allocation and scheduling;
  • NodeManager(nm): Resource management on a single node, processing commands from ResourceManager, processing commands from ApplicationMaster;
  • ApplicationMaster(am): Data segmentation, application resources application, and allocation to internal tasks, task monitoring and fault tolerance
  • Container: An abstraction of the task running environment, encapsulating multi-dimensional resources such as CPU and memory, as well as information related to task running such as environment variables and startup commands.
  • ResourceManager is the boss, NodeManager is the younger brother, and ApplicationMaster is the computing task specialist.

4. Hadoop Common: Tool modules that support other modules (Configuration, RPC, serialization mechanism, log operations).

information security

Through the summary of Hadoop security deployment experience, the following ten recommendations have been developed to ensure the security of data information in large-scale and complex and diverse environments.

1. Start first! Determine the data privacy protection strategy during the planning and deployment stage. It is best to determine the protection strategy before putting the data into Hadoop.

2. Determine which data is sensitive data of the enterprise. Comprehensive determination based on the company's privacy protection policy, as well as relevant industry regulations and government regulations.

3. Discover in time whether sensitive data is exposed or imported into Hadoop.

4. Collect information and decide whether security risks are exposed.

5. Determine whether business analysis requires access to real data, or whether sensitive data can be used. Then, choose the appropriate encryption technology. If there are any doubts, they are encrypted and hidden, while providing the most secure encryption technology and flexible response strategies to adapt to the development of future needs.

6. Make sure the data protection solution uses both hiding and encryption techniques , especially if we need to keep sensitive data separate in Hadoop.

7. Ensure that the data protection plan is applied to all data files to preserve the accuracy of data analysis in data aggregation.

8. Determine whether protection needs to be tailored for specific data sets, and consider dividing Hadoop directories into smaller, more secure groups.

9. Ensure that the chosen encryption solution interoperates with the company’s access control technology, allowing different users to selectively access data in the Hadoop cluster.

10. Ensure that when encryption is required, appropriate technologies (such as Java , Pig, etc.) can be deployed and support seamless decryption and quick access to data.

Guess you like

Origin blog.csdn.net/u012206617/article/details/133141941