Hadoop (1)--hadoop history, hadoop storage model, architecture model, read and write process, pseudo-distributed installation

Hadoop (1)

01hadoop history, hadoop storage model, architecture model, read and write process, pseudo-distributed installation

01

Scenario 1: Now there are 1T files, numbers, and rows stored. Two lines of text are the same. Find these two lines.

Step 1 : Use hashcode to traverse, save each line as a file, then the file can be named after hashcode. Finally, a bunch of files will be formed, and the content of the file is the content of the line. When traversing to the same line, only need to put it in the same hashcode file;

Step 2 : Traverse each file. As long as it traverses two identical hashcodes, two lines of the same text will be found.

Use multiple servers to optimize, but if there is only one server, how to optimize? (Not enough memory)

Change one server to multiple; is it necessary to generate so many files?

The file can be loaded into the memory at one time. The hashcode as the name of the file is the first step. You can take the modulus of the hashcode and use the remainder as the name of the file. In this way, the overall number of files is controlled, and the same two rows of data are in the same file.

Insert picture description here
Scenario 2: Now do a full sorting in positive order for the entire numerical file.

– Idea 1: Set some files in advance, and set a storage range for each file. In the end, a number of files were formed, which are characterized by orderly intervals but disordered internally .
– Idea 2: Take out a batch of data to sort each time, then take it and sort it. The characteristic of these small files is internal order , but the interval is disordered . Then use the merge algorithm to merge the data. (Secondary IO solution) The
above are the basic ideas for solving big data in a single machine.
Insert picture description here

02 cluster

  1. Cut the file into pieces;
  2. Parallel operation;
    Insert picture description here

Each server does the same thing as a stand-alone process. It calculates the hashcode and then modulates it.

When the same two lines are not on the same file, all you need to do is data migration. The purpose of migration is to put two identical rows on the same server. The principle of migration is to let each server process files with the same number.

The most important thing is that everyone does not deal with a lot, and they are doing it at the same time.

Question 1: Network IO transmission for data migration; Question 2: It takes time to cut complete files.

The purpose of migration is to perform operations on the data, and the mobile program moves to the middle of the data.

Distributed clusters need to be detailed.

HDFS: How to cut into groups;

03 Storage and Architecture Model

Hadoop-HDFS

The role of HDFS is: how big data stores data in clusters . Distributed: After being cut into parts, a complete file is hashed in the form of blocks on each node of the entire server cluster. Each block is allowed to create a copy, the size of the same file block is the same, and the block size of different files can be inconsistent.
Insert picture description here
Cut a complete file into small files and store them on different server nodes. Its storage depends on certain rules, and the rules are the design basis of the storage model. The bottom layer of the computer reads the data of 0101. Think of the file as a byte array, divide the byte array into different blocks, and put them on different servers.

Assuming that the file is 100 bytes, it must be divided into different blocks and placed on different servers.

A block is called a block, and the block can be stored by hashing on the node. Assuming that there are 5 servers, 10 blocks can be divided into two blocks, depending on the number of servers. The second point is to achieve load balancing in the end, and the blocks on each server are similar.

Place the block block.

Rely on what to cut the file: cut in bytes. (Regardless of whether the data is complete or not, cut directly and make up later.)

Offset: Refers to the starting offset position of each block into which each block is cut. The offset plays a role in index positioning.

A single file block is 10 if it is 10, but the size of the last block is not necessarily. The block size of the same file must be the same. If it is inconsistent, positioning problems will occur, and the file processing capability will be affected by the file size.

Set the number of copies: a file is cut into 10 blocks, the actual number of blocks is 30 (default). After cutting, prepare 3 identical block blocks (2 copies), and place the three copies of each block on 3 different server nodes. The purpose of this is for data security. Design several identical blocks for each block and put them in different nodes, thus ensuring the function of restoring data from other servers.

Cut into pieces, break into pieces, and expand horizontally. Copy to scale up.

If the number of copies exceeds the number of nodes, the two copies will be on the same node, which is meaningless.

The default size of a single block is 128M, and the default number of copies is 3. The size of the block determines the execution time of the program moved to the block.

If the number of copies or block files does not have too many jobs to access, you can follow the default. Assuming that the file block will often be accessed by jobs, don't put all processes in one block to run, let the job go to another server with the same block.

When uploading a block, it is specified once and uploaded to the node, and the block size cannot be modified or deleted afterwards. Focus on distributed computing, not on maintaining block size. Write once, read many times.

Architecture type-master-slave architectureInsert picture description here

Through a certain organizational relationship, data maintenance is achieved.

Master-slave architecture: The master node manages file metadata information MetaData, and the slave node is responsible for executing and processing specific file data, that is, block files.

Each role of the master node is used to maintain the integrity of metadata information, and the role of the slave node is used to maintain the resource information of the node, and specifically execute the operation mode of the data block placed on the node.

Master node: NameNode , slave node: DataNode .

Metadata information: including block size, block offset, and block permissions. Metadata information refers to the block attribute information that is cut into blocks, and is maintained by metadata.

Want to save and find quickly: Let a person be the manager to record. There is a list to indicate where the corresponding file is stored.

The role of the NameNode is to maintain the metadata information of the block files stored in the cluster . The slave node is used to maintain and manage the information of its own block. Each slave node has the role of a datanode, and each datanode is only responsible for maintaining and managing the block files placed on its own node.

The connection between NameNode and DataNode is relatively close. Keep the heartbeat between the two and ping each other every once in a while. It is necessary to determine which block blocks each datanode manages. The datanode provides a list of blocks to the namenode. (Provide a report)

Why is it not asking, but datanode reporting? – Keep your heartbeat, and you can complete the corresponding work.

HdfsClient : The relationship between the client and the cluster, to access the cluster or store data in the cluster, the client's handover object is the NameNode .

Insert picture description here
Store it with the help of the local file system, and put it under a certain directory with the help of the local server.

04 Endurance

Know the node where the data is located from the NameNode, and then the client can directly interact with the node where the data is located.

HDFS architecture:

Insert picture description here

hdfs client: To issue instructions for the hdfs storage architecture, first interact with the nameNode because the namenode stores the saved metadata information. The NameNode and the DataNode are connected by a dotted line, which means that the master node and the slave node maintain communication. The NameNode will control the information of the data on each DataNode, as well as the entire resource information.
After the client interacts with the NameNode, it then directly interacts with the DataNode.
After the Secondary NameNode reaches version 2.0, the effect is not great, and three copies of one block are set in the bottom layer.
Insert picture description here
A 50G file is placed in the cluster below, each cluster may not be able to store 50G, so the first step is to cut the file into a pile of 64M block blocks, and all copies of a block are placed in node's On the node list.

Memory storage persistence:
Why do you need persistence: It has a lot to do with the storage model of NameNode.
N ameNode is based on memory storage . The master node of NameNode only maintains metadata information and does not exchange with disks .

Under what circumstances will exchange occur-relational databases often have two-way exchanges between memory and disk. SQL statements add table data from the disk to the memory, and may be modified after adding it to the memory. After the change is completed in the memory, Write back to the table, this is an interaction. The necessity of interacting with the memory of the disk is embodied in the assumption that the data in the memory is loaded too much, is already full, and is overwritten to the disk, freeing up data to store new data in the memory.
If the overwritten data is written to the disk, the client needs to query the data, and the memory and the disk need to be continuously interacted.

But in the way Hadoop metadata is processed, it does not interact with the disk. It means that all processing is done in memory, and the purpose is to be fast. (Disk processing and memory processing are orders of magnitude difference)

All the information of the DataNode block must be handed over to the NameNode for maintenance, and it is required to process things quickly, so it is purely memory operation. The risk is that the data that only interacts with the memory is characterized by the need to do a persistent operation, because the memory data is easily lost when power is off.

Persistence: Store the data in memory to disk for permanent preservation. (unidirectional)

It will work when data is restored after persistence.
Insert picture description here
Why not save the location information of the block?
If one of the blocks is broken, failure information is provided.
Therefore, when interacting, the DataNode actively reports to the NameNode.

Summary: Block information is not submitted to disk files for storage.
The role and characteristics of NameNode persistence.

There are several ways to operate NameNode persistence:
Persistence-the metadata information in the memory is permanently stored on the disk in the form of a file.
Insert picture description here
fsimage is a mirrored snapshot, which retains a state of the entire database memory before the frame is stopped, and saves it in the disk file. What is the nature of storing disk files in the form of files? Serialization is to convert file objects into binary bytecode files. The advantage of conversion is compatibility. It is used across platforms, nodes, and files. Deserialization is to turn the binary of fsimage into a file. Recovery is faster and serialization is slow.

Edits is a log editing job: write every command operation of the client to all master nodes of the server into the operation log file. How to restore after writing? The log files are all instructions, which are executed again when restoring. Then the recovery is slower.

Summary: There are two ways to persist-fsimage mirror snapshot (slow writing, fast recovery), edits log editing (fast writing, slow recovery).

Why is mirrored snapshot called point-in-time snapshot?
According to the timing to complete the timing. (If it is generated all the time, there will be a problem of interaction between memory and disk.)

How to complete the recovery with the two?
Idea: Consider the time period when these files are generated, when to generate the fsimage file and the edits log file.

The first fsimage is generated when the entire file system is formatted (the purpose of formatting is to generate the fsimage file). During the format process, the first fsimage file will be generated. At this time, there is nothing in the fsimage file.

When the Hadoop cluster is started, it will have its own set of procedures. After the cluster is running, it will read the fsimage file first, and will generate the edits log file (all empty). Edits and fsimage will do a merge work and merge into a new fsimage, no matter how many times the fsimage is started, it will not disappear. The editslog will continue to grow.

When the editslog is too large, using the fast file recovery performance of fsimage, based on it as a recovery, will produce a re-merging job.

In Hadoop recovery, fsimage is used as the basis, and fsimage and edits are merged. How to merge? Who will be the role of the merger?

Introduced: SecondaryNameNode (SNN)-responsible for merging.
Insert picture description here
Insert picture description here
After the service is stopped, the order in which the files are read again:
Insert picture description here

Thinking: Why does Hadoop 1.0 find a secondary node alone to complete it?

Because the NameNode itself has a lot of work, once something goes wrong, the merge will fail.

Under what circumstances will merge, the default situation is two conditions: the angle of time and the angle of data volume.

Insert picture description here
Is the secondary called the standby of the primary node?
No, just merge. Did not play a role in preparation.

Insert picture description here
The daemon on the slave node also maintains a metadata information file on the slave node. Metadata information and the block generated on the node have a direct connection relationship. The main data of metadata is an md5 file. (Each block corresponds to an md5 file)

The verification value of md5 represents the integrity of the currently uploaded file block. Assuming that the file block needs to be downloaded, the file must be verified.

Why wait so long?
A single node has a lot of data and cannot be deleted casually.

4. Reading and writing process

Insert picture description here
High fault tolerance : It is maintained by the number of copies of multiple nodes. When one node hangs up, the redundant copies can be read from other nodes. After the copy is lost, there will be an automatic recovery mechanism. The recovery mechanism is to automatically complete a copy after data loss. From the existing good node, copy the copy back. This management strategy involving Hadoop copies ensures fault tolerance.
Suitable for batch processing : calculation is moved instead of data. Try to avoid data transmission. Instead, the calculation framework is moved to the data to perform localized operations, and the block is used to guide the confirmation on which node.
Insert picture description here

Insert picture description here
In order to achieve data security, each block of Hadoop generates several copies, which are hashed on different nodes. The solution to the replica placement strategy is how to select the replica nodes.

The client happens to be on the cluster server and uploads the cluster through a copy.

The core process is the core process of the distributed file storage system. There are a write process and a read process:
write process-how to cut data into blocks and upload to different nodes, how to complete this process;
read process-how to get from the node Different servers read different block information.

Writing process :
Hadoop's clint client is a very important role in completing related tasks.
The first step: Distributed file system object, reflected in java is a class of java. Distributed FileSystem pay attention when writing: files are transmitted from one node to another node, and network transmission data is transmitted by IO stream. Through distributed file system objects, a stream is created: output stream, which is called input for program, and output stream for program.
(Input stream and output stream, byte stream and character stream, package stream and elementary stream)

When writing, there is no direct contact with the DataNode, first through the namenode.
The NameNode also creates a path through the metadata information and the list of files , and the path points to the location of the file to be stored. Each block has three copies, then upload a block to have three copies to save, when writing: 1. Communication with the first node of the three nodes; 2. Cut a block into several small data blocks, like Storage is the same as the pipeline; streaming saves time. The confirmation mechanism only occurs between the DataNode and the first DataNode, and only the first block is passed.

Insert picture description here
Insert picture description here
Reading process:
Insert picture description here
A block may be above three nodes, but the content of the three blocks is the same. Choose to read the nearest block.
Insert picture description here

5. Pseudo-distributed

Insert picture description here

Insert picture description here
The meaning of the third sentence : For example, uploading a file to the hdfs distributed system file through the client now requires a certain process. During this process, the file cannot be seen, and the real data file is replaced until the file is uploaded.

Insert picture description here
The datanode village is the file block;

Pseudo-distributed: SN, DN, NN roles are placed on the same server.

The role of the key file: Provide three nodes, each of which runs a process. Each role is a process. If it is fully distributed, all processes that require the number of nodes are started, and one server needs to be scripted to disable other servers. This involves logging into another server from one server, which is the function of the keyless file.

The official website documents need to install two software: java jdk and ssh keyless file.

Who generates the public key, who is the management node, and who generates the public key.

The configuration file specifies the block size,

Added home environment variable setting: when starting the cluster, if it is not set, it will find the local java path by default.

core-site.xml specifies the configuration information of the master role process, and hdfs-site.xml is the slave node.

slaves specifies the configuration information of the slave node;

If there is no hadoop.tmp.dirconfiguration, the data will be stored in the temporary directory of tmp;

The unique identification number of the cluster is generated in the format phase, which is generated in the format phase: fsiimage, image file, and current cluster id.

The cluster id is shared by all roles in the entire cluster. Namenode has this cluster id, datanode and secondnode have this cluster id.

The id of the current node changes every time the visualization is completed.

Question:
If unfortunately formatted several times, how to solve this problem?

Pseudo-distributed construction

  1. Install jdk, configure environment variables, and test;

2. Free key

cdGo to the root directory and see hidden files, .ssh files.
ssh-keygen -t dsa -P'' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

id_dsa.pub is the real key, that is, the public key. Who is the management node, who generates the public key file.

Insert picture description here
One more authorized_keys means that the operation is complete.

The alias must be used to operate the hadoop cluster.


  1. The configuration file for decompressing/installing hadoop to run the cluster is in the etc directory, and the system-level executable script is in the sbin directory.

You can run the Hadoop cluster in any directory, so what you need to do is to configure the Hadoop environment variables.

Two environment variable paths need to be appended: bin and sbin;

Insert picture description here
4. Configure each configuration file. When the
Hadoop cluster is started, it will only access the Hadoop directory under etc.

Hadoop's second JAVA_HOME environment variable configuration

First do a javahome environment variable setting. When starting the cluster, the local java startup will be found by default. You need to tell where the path is in some files, otherwise it will report a jvm not found error.

hadoop-env.sh

echo $JAVA_HOME
vim hadoop-env.sh
vim map-env.sh
vim yarn-env.sh

The file is in the hadoop directory under the etc directory of hadoop.
Modify the configuration file
core-site.xml to specify the configuration information of the main role process: NameNode

vi core-site.xml

    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node06:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/sxt/hadoop/local</value>
    </property>

Configure hdfs-site.xml: Configure the configuration information of the slave node and the number of replicas. For pseudo-distribution, one block is enough for one replica. The NameNode is there, the number of replicas is there, and the DataNode and SecondaryNode are not.

<property>
        <name>dfs.replication</name>
        <value>1</value>
</property>

Slaves specifies the configuration information of the slave nodes. The pseudo-distributed master and slave nodes are all placed on one node. You can change the host name here:
Insert picture description here

配置 sencondarynamenode:
Insert picture description here

Configure secondarynamenode in hdfs-site.xml:

    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>node06:50090</value>
    </property>

Open the core-sit file and modify in the configuration file:
Insert picture description here

The purpose is to redirect this path.

Insert picture description here

Format:
Generate an image file of fsi-image.

Each Hadoop cluster will generate a unique identification number when it is started. Three important contents are generated in the format stage: fsimage, image file, and id of the current cluster. The cluster id is shared by all roles in the cluster. If the cluster id is different, some roles may not run when starting the cluster.

Formatting multiple times is not necessarily a good thing.

格式化hdfs
          hdfs namenode -format  (只能格式化一次,再次启动集群不要执行)
启动集群
         start-dfs.sh

角色进程查看:jps
帮助: hdfs 
       hdfs dfs 

Insert picture description here

You can see that hadoop started successfully.
Insert picture description here

Complete file upload operation:
upload files are interacted by the client and NameNode

First create the path path;

Insert picture description here

upload files:
Insert picture description here

In what state does the file exist after it is uploaded?

  • Store in blocks, under the data directory of the storage path: cut into two blocks, each block is 128MB in size;
    Insert picture description here

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_29027865/article/details/109913436