Hadoop summary (on)

Recently, I am studying distributed training and storage of large models. My own distributed related foundation is relatively weak. All architectures based on deep learning come from tradition. I summarized the previous distributed solution for big data, namely Hadoop:

Why Hadoop

The role of Hadoop is very simple, it is to create a unified and stable storage and computing environment in a multi-computer cluster environment, and to provide platform support for other distributed application services.

To some extent, Hadoop organizes multiple computers into one computer (doing the same thing), then HDFS is equivalent to the hard disk of this computer , and MapReduce is the CPU controller of this computer .

Trouble

Since Hadoop is a software designed for clusters, we will inevitably encounter the situation of configuring Hadoop on multiple computers when learning its use . This will create many obstacles for learners. There are two main ones:

  1. Expensive computer clusters. A cluster environment composed of multiple computers requires expensive hardware.
  2. Difficult to deploy and maintain. Deploying the same software environment on many computers is a lot of work and is very inflexible and difficult to redeploy after the environment changes.

In order to solve these problems, we have a very mature way Docker .

Docker is a container management system that can run multiple "virtual machines" (containers) like virtual machines and form a cluster. Because the virtual machine completely virtualizes a computer, it consumes a lot of hardware resources and is inefficient. Docker only provides an independent and reproducible operating environment. In fact, all processes in the container are still executed in the kernel on the host . Execution, so it is almost as efficient as the process on the host (nearly 100%).

Overall design of Hadoop

The Hadoop framework is a framework for computer cluster big data processing, so it must be a software that can be deployed on multiple computers. Hosts on which Hadoop software is deployed communicate via sockets (network).

Hadoop mainly includes two components: HDFS and MapReduce. HDFS is responsible for distributing and storing data, and MapReduce is responsible for mapping and processing data, and summarizing the processing results.

The most fundamental principle of the Hadoop framework is to use a large number of computers to operate simultaneously to speed up the processing of large amounts of data . For example, if a search engine company wants to filter and summarize hot words from trillions of pieces of data that have not been standardized, it needs to organize a large number of computers to form a cluster to process the information. If a traditional database is used to process this information, it will take a long time and a large processing space to process the data. This magnitude becomes difficult for any single computer. The main difficulty lies in organizing a large number of hardware And high-speed integration into a computer, even if it is successfully implemented, expensive maintenance costs will occur.

Hadoop can run on as many as a few thousand inexpensive mass-produced computers organized as a computer cluster.

A Hadoop cluster can efficiently store data and distribute processing tasks, which will have many benefits. Firstly, it can reduce the cost of computer construction and maintenance. Secondly, once any computer has a hardware failure, it will not have a fatal impact on the entire computer system, because the cluster framework for application layer development must assume that the computer will fail .

HDFS

Hadoop Distributed File System, Hadoop Distributed File System, HDFS for short.

HDFS is used to store files in the cluster. The core idea it uses is Google's GFS idea, which can store large files.

In server clusters, file storage is often required to be efficient and stable, and HDFS realizes these two advantages at the same time.

HDFS's efficient storage is achieved by clusters of computers processing requests independently. Because the user (half of which is the back-end program) sends out a data storage request, it often responds that the server is processing other requests, which is the main reason for slow service efficiency. But if the response server directly assigns a data server to the user, and then the user directly interacts with the data server, the efficiency will be much faster.

The stability of data storage is often achieved by "several more copies", which is also used by HDFS. The storage unit of HDFS is block (Block), and a file may be divided into multiple blocks and stored in physical memory. Therefore, HDFS often copies n copies of data blocks according to the requirements of the setter and stores them on different data nodes (servers that store data). If a data node fails, the data will not be lost.

HDFS nodes

HDFS runs on many different computers, some dedicated to storing data, and some dedicated to directing other computers to store data. The "computer" mentioned here can be called a node in the cluster.

Naming Node (NameNode)

NameNode (NameNode) is a node used to direct the storage of other nodes. Any "file system" (File System, FS) needs to have the function of mapping to files according to the file path. The named node is a computer used to store these mapping information and provide mapping services, acting as an "administrator" in the entire HDFS system role, so there is only one named node in an HDFS cluster.

Data Node (DataNode)

DataNode A node used to store data blocks . When a file is recognized by the named node and divided into blocks, it will be stored in the assigned data node. Data nodes have the functions of storing data and reading and writing data. The stored data blocks are similar to the concept of "sector" in the hard disk, which is the basic unit of HDFS storage .

Secondary NameNode

Secondary NameNode (Secondary NameNode) alias "Secondary NameNode", is the " secretary " of the NameNode. This description is very appropriate, because it does not replace the work of the named node, regardless of the ability of the named node to continue to work. It is mainly responsible for offloading the namenode, backing up the state of the namenode and performing some administrative work if the namenode asks it to do so. It can also provide backup data to restore the NameNode if the NameNode goes down. There can be multiple subnamed nodes.

Please add a picture description

MapReduce

The meaning of MapReduce is as obvious as its name: Map and Reduce (mapping and reduction).

big data processing

The processing of large amounts of data is a typical "simple in reason, complex in implementation" matter. The reason for "complicated implementation" is mainly that the hardware resources (mainly memory) will be insufficient when a large amount of data is processed by traditional methods.

Now there is a piece of text (this string may be as long as 1 PB or more in a real environment), we perform a simple "number character" statistics, that is, count the number of all characters that have appeared in this text:

AABABCCABCDABCDE

The result after statistics should be:
A 5
B 4
C 3
D 2
E 1
The process of statistics is actually very simple, that is, every time a character is read, it is necessary to check whether the same character already exists in the table, if not, add a record and Set the record value to 1, and directly increase the record value by 1 if there is one.

But if we change the statistical object here from "characters" to "words", then the sample size will become very large in an instant, so that it may be difficult for a computer to count the "words" used by billions of users in a year.

In this case, we still have a way to complete this work - we first divide the sample into sections that can be processed by a single computer, and then perform statistics section by section, and statute the mapping statistical results every time the statistics are performed Processing is to combine statistical results into a larger data result, and finally complete large-scale data reduction.

In the above case, the first stage of finishing work is "mapping", classifying and sorting the data, so far, we can get a result that is much smaller than the source data. The second stage of work is usually done by the cluster. After sorting out the data, we need to summarize the data as a whole. After all, there may be overlapping classifications of the mapping results of multiple nodes. The results of the mapping in this process will be further reduced into obtainable statistical results.

MapReduce concepts

Example:

Suppose there are 5 files, each with two columns recording the name of a city and the corresponding temperature recorded in that city on different measurement dates. The city name is the key (Key), and the temperature is the value (Value). For example: (Xiamen, 20). Now we want to find the maximum temperature for each city in all the data (note that the same city may appear in each file).

Using the MapReduce framework, we can break this down into 5 map tasks, where each task is responsible for processing one of the five files. Each map task examines each piece of data in the file and returns the maximum temperature for each city in the file.

For example, for the following data:

City temperature
Xiamen 12
Shanghai 34
Xiamen 20
Shanghai 15
Beijing 14
Beijing 16
Xiamen 24

For example, you can think of MapReduce as a census , and the Census Bureau will send several investigators to each city. Each census taker in each city will count a portion of that city's population, and the results will be aggregated back to the capital. In the capital, the statistics for each city will be reduced to a single count (the population of each city), and then the total population of the country can be determined. This person-to-city mapping is parallelized, and the results are combined (Reduce). This is much more efficient than sending one person to count everyone in the country in a continuous fashion.

Hadoop three modes: stand-alone mode, pseudo-cluster mode and cluster mode
  • Stand-alone mode: Hadoop only exists as a library, which can execute MapReduce tasks on a single computer, and is only used for developers to build a learning and experiment environment.
  • Pseudo-cluster mode: In this mode, Hadoop will run on a single machine in the form of a daemon process, which is generally used for developers to build a learning and experiment environment.
  • Cluster mode: This mode is the production environment mode of Hadoop, that is to say, this is the mode that Hadoop really uses to provide production-level services.

HDFS configuration and startup

HDFS is similar to a database and is started as a daemon process . To use HDFS, you need to use HDFS client to connect to HDFS server through network (socket) to realize the use of file system.

Configure the basic environment of Hadoop, the container name is hadoop_single , start and enter the container.

Once inside the container, verify that Hadoop exists:

hadoop version

Hadoop exists if the result shows the Hadoop version number.

Next we will move to the formal steps.

Create a new hadoop user

Create a new user named hadoop:

adduser hadoop

Install a small tool for modifying user passwords and rights management:

yum install -y passwd sudo

Set hadoop user password:

passwd hadoop

Enter the password for the next two times, be sure to remember it!

Modify the owner of the hadoop installation directory to be the hadoop user:

chown -R hadoop /usr/local/hadoop

Then modify the /etc/sudoers file with a text editor, in

root    ALL=(ALL)       ALL

add a line after

hadoop  ALL=(ALL)       ALL

Then exit the container.

Close and submit container hadoop_single to mirror hadoop_proto:

docker stop hadoop_single
docker commit hadoop_single hadoop_proto

Create new container hdfs_single:

docker run -d --name=hdfs_single --privileged hadoop_proto /usr/sbin/init

This way the new user is created.

Start HDFS

Now enter the newly created container:

docker exec -it hdfs_single su hadoop

Should now be the hadoop user:

whoami

Should show "hadoop"

Generate SSH keys:

ssh-keygen -t rsa

Here you can keep pressing Enter until the generation ends.

Then add the generated key to the trust list:

ssh-copy-id hadoop@172.17.0.2

View the container IP address:

ip addr | grep 172

So you know that the IP address of the container is 172.17.0.2, your IP may be different from this.

Before starting HDFS, we make some simple configurations. All Hadoop configuration files are stored in the etc/hadoop subdirectory under the installation directory, so we can enter this directory:

cd $HADOOP_HOME/etc/hadoop

Here we modify two files: core-site.xml and hdfs-site.xml

In core-site.xml, we add the attribute under the tag:

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://<你的IP>:9000</value>
</property>

Add the property under the tag in hdfs-site.xml:

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

Format file structure:

hdfs namenode -format

Then start HDFS:

start-dfs.sh

The startup is divided into three steps, starting the NameNode, DataNode and Secondary NameNode respectively.

Run jps to view the Java process

So far, the HDFS daemon process has been established. Since HDFS itself has an HTTP panel, we can visit http://your container IP:9870/ through a browser to view the HDFS panel and detailed information.

HDFS uses

HDFS Shell

Back to the hdfs_single container, the following commands will be used to operate HDFS:

# 显示根目录 / 下的文件和子目录,绝对路径
hadoop fs -ls /
# 新建文件夹,绝对路径
hadoop fs -mkdir /hello
# 上传文件
hadoop fs -put hello.txt /hello/
# 下载文件
hadoop fs -get /hello/hello.txt
# 输出文件内容
hadoop fs -cat /hello/hello.txt

The most basic commands of HDFS are described above, and there are many other operations supported by traditional file systems.

HDFS API

HDFS has been supported by many back-end platforms. Currently, the official distribution includes C/C++ and Java programming interfaces. In addition, the package managers of node.js and Python languages ​​also support importing HDFS clients.

Here is a list of dependencies for the package manager:

Maven:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>3.1.4</version>
</dependency>

Gradle:

providedCompile group: 'org.apache.hadoop', name: 'hadoop-hdfs-client', version: '3.1.4'

NPM:

npm i webhdfs 

pip:

pip install hdfs

Example of Java connecting to HDFS (modify IP address):

实例
package com.zain;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
public class Application {
    
    
    public static void main(String[] args) {
    
    
        try {
    
    
            // 配置连接地址
            Configuration conf = new Configuration();
            conf.set("fs.defaultFS", "hdfs://172.17.0.2:9000");
            FileSystem fs = FileSystem.get(conf);
            // 打开文件并读取输出
            Path hello = new Path("/hello/hello.txt");
            FSDataInputStream ins = fs.open(hello);
            int ch = ins.read();
            while (ch != -1) {
    
    
                System.out.print((char)ch);
                ch = ins.read();
            }
            System.out.println();
        } catch (IOException ioe) {
    
    
            ioe.printStackTrace();
        }
    }
}

Guess you like

Origin blog.csdn.net/weixin_44659309/article/details/132382328